Series核心操作郝伟 2021/03/04 [TOC]

1. 简介

Pandas是非常强大的二维数组操作库。而二维库是由多个一级的series组成，它具有以下内容：

数据：可以是标量值或集合（list、dict或set）。
index：索引的值应唯一且可哈希。它必须与数据长度相同。
dtype：是指系列的数据类型。

本文将介绍Series的基本使用方法。

2. 1 创建

2.1. 1.1 通过字典操作

根据以下示例可见：

通过字典可以直接生成Series对象
键会成为Series的索引，值会变成Series的数据
值类型可以不一样（这点不同于Numpy）

import numpy as np
import pandas as pd
data = {
  '1':1,
  '2':2,
  '3':3,
  '4':'hello',
  '5':'python',
  'list':[1,2]
}
s1 = pd.Series(data)
print(s1, type(s1))

#运行结果
1             1
2             2
3             3
4         hello
5        python
list1    [1, 2]
dtype: object <class 'pandas.core.series.Series'>

2.2. 1.2 通过numpy数组创建

| 在一个Series对象中，数据是必需的，所以通过pd.Series() | 数在创建时，第1个就是数据。下面的示例中使用 np. |random.rand(5) 函数生成长度为5的一组narray对象，对Series进行初始化。index就是索引，长度与数据必需一致。name是Series对象的名称用于显示。这两个变量都不是必需的。

import numpy as np
import pandas as pd
# 三个参数分别表示数据，索引和Series的名称
s = pd.Series(np.random.rand(5), index = list('abcde'), name = 'test')
print(s,type(s))

# 输出
a    0.478839
b    0.517298
c    0.854202
d    0.543885
e    0.032623
Name: test, dtype: float64 <class 'pandas.core.series.Series'>

2.3. 1.3 通过标量创建

所谓标题就是一维的常量，如下所示，可以使用数字3创建一个长度为5的Series对象。这里，长度是由索引决定的，由于数据只有一标量3，所以就用3填充5次。

import numpy as np
import pandas as pd

s = pd.Series(3,index=list('abcde'))
print(s)

# 输出
a    3
b    3
c    3
d    3
e    3
dtype: int64

3. 2 数据访问

数据访问就是访问Series对象中数据的方法。由于Series是一维的，所以常规可以通过索引或偏移量的方式进行访问数据。

3.1. 2.1 通过下标访问

通过下标访问是最常规的一种方法，可以将Series对象当作数组一样使用下标进行访问，下标同样从0开始。

import numpy as np
import pandas as pd

s = pd.Series(np.random.rand(5))
print(s,'\n')
print('s[2]:', s[2],type(s[2]),s[2].dtype)

# 输出
0    0.949404
1    0.400692
2    0.660859
3    0.295815
4    0.680184
dtype: float64 

s[2]: 0.6608588265235231 <class 'numpy.float64'> float64

3.2. 2.2 通过索引访问

通过索引访问就是利用Series中的index访问对应的数据，可以理解为将Series当作字典，使用key访问其value。不过其访问功能更加强大，除了可以使用单个key访问其value，还可以使用包含多个key的列表，一次获得多个value。

需要注意，使用单个key访问时，若key不存在时，则会报错，如果使用key列表，则返回为None。

import numpy as np
import pandas as pd

s = pd.Series(np.random.rand(5), index = list('abcde'))
print(s)
print('-'*10, '\n')
print("s['a']:", s['a'], '\n')
print("--- s[['b','e', 'f']] ---")
print(s[['b','e', 'f']])

# 输出
a    0.977675
b    0.128278
c    0.110421
d    0.413023
e    0.568087
dtype: float64
---------- 

s['a']: 0.9776748201255117 

--- s[['b','e', 'f']] ---
b    0.128278
e    0.568087
f         NaN
dtype: float64
# s['f']不存在，第一次会给出报警，但可以正常执行
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\series.py:1152: FutureWarning: 
Passing list-likes to .loc or [] with any missing label will raise
| KeyError in the future, you can use .reindex() | as an alternative. |

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike
  return self.loc[key]

3.3. 2.3 通过切片访问

类似于列表，Series也可以进行切片操作。另外，切片还支持范围操作，说明索引是排序好的。

import numpy as np
import pandas as pd

s1 = pd.Series(np.random.rand(5),list('abcde'))
print(s1,'\n')
print(s1['a':'c'],'\n')      #用index做索引的话是末端包含的
print(s1[0:2],'\n')          #用下标做切片索引的话和list切片是一样的，不包含末端   

# 输出
a    0.634454
b    0.132619
c    0.211219
d    0.559798
e    0.424643
dtype: float64 

a    0.634454
b    0.132619
c    0.211219
dtype: float64 

a    0.634454
b    0.132619
dtype: float64

3.4. 2.4 布尔变量访问

布尔型索引判断，生成的是一个由布尔型组成的新的Series。函数 .isnull() 和 .notnull() 判断是否是空值，其中 None 表示空值，NaN 表示有问题的值，两个都会被判断为空值。

import numpy as np
import pandas as pd

s = pd.Series([0.2, 0.5, None])
print(s,'\n')     
print(s > 50,'\n')
print(s.isnull(), '\n')
print(s.notnull(), '\n')
print(s[s > 50])

# 输出
0    0.2
1    0.5
2    NaN
dtype: float64 

0    False
1    False
2    False
dtype: bool 

0    False
1    False
2     True
dtype: bool 

0     True
1     True
2    False
dtype: bool 

Series([], dtype: float64)

4. 3 索引操作

除了数据的操作，索引的操作也很重要，下面是对索引的一些常规操作。

4.1. 根据数据分组

4.2. 3.1 索引属性

除了数据访问，我们还可以访问索引内容。索引的类型。以下代码展示了常用的索引类型，基本范围都是与range相关的内容。

import numpy as np
import pandas as pd

s = pd.Series(np.random.rand(5), index=range(5))
print('type(s.index): ', type(s.index), '\n')
print('s.index:', s.index, '\n')

s = pd.Series(np.random.rand(5), index=list('abcde'))
print('type(s.index): ', type(s.index), '\n')
print('s.index:', s.index, '\n')

s = pd.Series(np.random.rand(5), index=pd.date_range('2018-01-01', periods=5))
print('type(s.index): ', type(s.index), '\n')
print('s.index:', s.index, '\n')


# 输出
type(s.index):  <class 'pandas.core.indexes.range.RangeIndex'> 

s.index: RangeIndex(start=0, stop=5, step=1) 

type(s.index):  <class 'pandas.core.indexes.base.Index'> 

s.index: Index(['a', 'b', 'c', 'd', 'e'], dtype='object') 

type(s.index):  <class 'pandas.core.indexes.datetimes.DatetimeIndex'> 

s.index: DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
               '2018-01-05'],
              dtype='datetime64[ns]', freq='D')

4.3. 3.2 访问索引

由于索引类型基本与range相关，所以可以与list类型一样使用下标和范围进行访问。

import numpy as np
import pandas as pd

s = pd.Series(np.random.rand(5), index=range(5))

print('type(s)', type(s), '\n')

print(s, '\n')

# 查看索引
print(s.index, '\n')

# 查看范围为 [1, 3) 的索引范围
print(s.index[1:3], '\n')

# 可以直接使用选定范围后的内容查看s的数据
print(s[s.index[1:3]])

# 遍历索引，然后显示索引对应的值
for id in s.index:
    print(s[id])

# 输出
type(s) <class 'pandas.core.series.Series'> 

0    0.138492
1    0.285440
2    0.280471
3    0.245737
4    0.996996
dtype: float64 

RangeIndex(start=0, stop=5, step=1) 

RangeIndex(start=1, stop=3, step=1)

1    0.285440
2    0.280471
dtype: float64

0.13849193313381447
0.2854401610542934
0.280470887359729
0.2457365359030208
0.996996040313859

5. 4 基本操作

5.1. 4.1 添加数据

import numpy as np
import pandas as pd

s1 = pd.Series(np.random.rand(2))
print('s1')
print(s1)

# 方法1：使用索引直接添加新数据
s1[3]= 100      # 添加索引为3的数据300 
s1['a'] = 200    # 索引为'a'的数据为200
print('\ns1添加两数据')
print(s,'\n')

# 方法2：直接调用append方法添加
s2 = pd.Series(np.random.rand(2), index = ['value1','value2'])
print('\ns2')
print(s2)
s3 = s.append(s2)        #用append()增添
print('\ns1.append(s2)')
print(s3)

# 输出
s1
0    0.981331
1    0.555244
dtype: float64

s1添加两数据
0      0.570088
1      0.835804
3    100.000000
a    200.000000
dtype: float64 


s2
value1    0.089712
value2    0.399171
dtype: float64

s1.append(s2)
0           0.570088
1           0.835804
3         100.000000
a         200.000000
value1      0.089712
value2      0.399171
dtype: float64

5.2. 4.2 删除数据

import numpy as np
import pandas as pd
s = pd.Series(np.random.rand(5),index = list('abcde'))
print('s')
print(s)


del s['a']           #用del删除
print("\n删除1个数据：del s['a']")
print(s,'\n')

s1 = s.drop(['c','d'])           #用.drop()删除，删除多个要加[]
print("\n删除多个数据：s1 = s.drop(['c','d']) ")
print(s1)

# 输出
s
a    0.687421
b    0.938094
c    0.391408
d    0.667542
e    0.245056
dtype: float64

删除1个数据：del s['a']
b    0.938094
c    0.391408
d    0.667542
e    0.245056
dtype: float64 


删除多个数据：s1 = s.drop(['c','d']) 
b    0.938094
e    0.245056
dtype: float64

5.3. 4.3 修改数据

数据修改直接使用索引指定进行赋值操作，可以单个修改也可以批量修改。

import numpy as np
import pandas as pd

s = pd.Series(np.random.rand(5),index = list('abcde'))
print(s,'\n')
s[1] = 100 # 直接赋值
print(s,'\n')
s[['c','d']] = 200 # 批量赋值
print(s)

# 输出
a    0.317819
b    0.359241
c    0.662112
d    0.087609
e    0.940697
dtype: float64 

a      0.317819
b    100.000000
c      0.662112
d      0.087609
e      0.940697
dtype: float64 

a      0.317819
b    100.000000
c    200.000000
d    200.000000
e      0.940697
dtype: float64

5.4. 4.4 查看数据

类似于Linux的head和tail命令，可以使用s.head(n)和s.tail(n)进行数据访问。

import numpy as np
import pandas as pd
s = pd.Series(np.random.rand(10))
print(s.head(2),'\n')
print(s.tail(3))

# 输出
0    0.140628
1    0.768699
dtype: float64 

7    0.255628
8    0.535300
9    0.324614
dtype: float64

5.5. 4.5 重建索引

.reindex(新的标签,fill_value = )会根据更改后的标签重新排序，若添加了原标签中没有的新标签，则默认填入NaN，参数fill_value指对新出现的标签填入的值。

import numpy as np
import pandas as pd
s = pd.Series(np.random.rand(3),
             index = ['a','b','c'])
print(s, '\n')
s1 = s.reindex(['c','b','a','A'],fill_value = 100)
print(s1)

# 输出
a    0.692466
b    0.757568
c    0.181863
dtype: float64 

c      0.181863
b      0.757568
a      0.692466
A    100.000000
dtype: float64

5.6. 4.6 数据对齐

数据对齐的目的是根据索引，对数据进行相应的操作，如相加。

import numpy as np
import pandas as pd

s1 = pd.Series(np.random.rand(3),
             index = ['a','b','c'])
s2 = pd.Series(np.random.rand(3),
             index =['a','c','A'])
print(s1,'\n')
print(s2,'\n')
print(s1+s2)
# 输出
a    0.414064
b    0.599441
c    0.579188
dtype: float64 

a    0.163382
c    0.095508
A    0.521609
dtype: float64 

A         NaN
a    0.577446
b         NaN
c    0.674696
dtype: float64

6. 5 数据统计

6.1. 5.1 功能介绍

常用的统计函数以下表所示：

函数	含义
aggregate()	聚合运算，用于自定义统计函数，待研究。
all()	等价于逻辑“与”
any()	等价于逻辑“或”
idxmin()	寻找最小值对应的所在位置
idxmax()	寻找最大值所在位置
count()	计数，None不统计。
cumsum()	运算累计和
cumprod()	运算累计积
cov()	计算协方差
corr()	计算相关系数
describe()	描述性统计，返回多个常用统计结果。
groupby()	分组
kurt()	计算峰度
max()	计算最大值
mean()	计算平均值
median()	计算中位数
min()	计算最小值
mode()	计算众数
pct_change()	运算比率（后一个元素与前一个元素的比率）
quantile()	计算任意分位数
size()	计数（统计所有元素的个数）
skew()	计算偏度
std()	计算标准差
sum()	求和
value_counts()	频次统计，即按相同值分组，返回每组的数据个数。
var()	计算方差

6.2. 5.2 代码演示

以下为演示代码，以展示主要函数使用效果。注：部分函数测试未通过，待进一步调研。

import numpy as np
import pandas as pd
data=[1,2,3,4,5,5,6,8,1,3,5,2,5,2]
s = pd.Series(data)
print(s)

#print('s.aggregate()', s.aggregate(3), '\n')
print('s.all()', s.all(), '\n')
print('s.any()', s.any(), '\n')
print('s.idxmin()', s.idxmin(), '\n')
print('s.idxman()', s.idxmax(), '\n')
print('s.count()', s.count(), '\n')
print('s.cumsum()', s.cumsum(), '\n')
print('s.cumprod()', s.cumprod(), '\n')
#print('s.cov()', s.cov(), '\n')
#print('s.corr()', s.corr(), '\n')
print('s.describe()', s.describe(), '\n')
#print('s.groupby()', s.groupby(5), '\n')
print('s.kurt()', s.kurt(), '\n')
print('s.max()', s.max(), '\n')
print('s.mean()', s.mean(), '\n')
print('s.median()', s.median(), '\n')
print('s.min()', s.min(), '\n')
print('s.mode()', s.mode(), '\n')
print('s.pct_change()', s.pct_change(), '\n')
print('s.quantile()', s.quantile(), '\n')
#print('s.size()', s.size(), '\n')
print('s.skew()', s.skew(), '\n')
print('s.std()', s.std(), '\n')
print('s.sum()', s.sum(), '\n')
print('s.value_counts()', s.value_counts(), '\n')
print('s.var()', s.var(), '\n')

# 输出
0     1
1     2
2     3
3     4
4     5
5     5
6     6
7     8
8     1
9     3
10    5
11    2
12    5
13    2
dtype: int64
s.all() True 

s.any() True 

s.idxmin() 0 

s.idxman() 7 

s.count() 14 

s.cumsum() 
0      1
1      3
2      6
3     10
4     15
5     20
6     26
7     34
8     35
9     38
10    43
11    45
12    50
13    52
dtype: int64 

s.cumprod() 
0           1
1           2
2           6
3          24
4         120
5         600
6        3600
7       28800
8       28800
9       86400
10     432000
11     864000
12    4320000
13    8640000
dtype: int64 

s.describe() 
count    14.000000
mean      3.714286
std       2.054210
min       1.000000
25%       2.000000
50%       3.500000
75%       5.000000
max       8.000000
dtype: float64 

s.kurt() -0.33190548058712066 

s.max() 8 

s.mean() 3.7142857142857144 

s.median() 3.5 

s.min() 1 

s.mode() 0    5
dtype: int64 

s.pct_change() 
0          NaN
1     1.000000
2     0.500000
3     0.333333
4     0.250000
5     0.000000
6     0.200000
7     0.333333
8    -0.875000
9     2.000000
10    0.666667
11   -0.600000
12    1.500000
13   -0.600000
dtype: float64 

s.quantile() 3.5 

s.skew() 0.4487734149006034 

s.std() 2.054210364052382 

s.sum() 52 

s.value_counts() 
5    4
2    3
3    2
1    2
8    1
6    1
4    1
dtype: int64 

s.var() 4.21978021978022

7. 6 文件读写

7.1. 6.1 保存

import numpy as np
import pandas as pd
data=[1,2,3,4,5,5,6,8,1,3,5,2,5,2]
s = pd.Series(data)
s.to_json(r'c:\data\s.json')

7.2. 6.2 读取

读取需要注意，之前提供了 Series.from_csv(file) 这样的函数，现在已经取消了，推荐使用 read_csv(file)。由于这个函数读取后的对象是 DataFrame，所以还需要提取了出Series。示例如下所示：

import pandas as pd
s = pd.Series(list("ThisisamapofChina."), name='s0')
s.to_csv("s.csv") 
# 这里s1与s 等价
s1=pd.read_csv("s.csv")['s0']

8. 7 数据排序与分组

8.1. 7.1 数据排序

数据排序使用 sort_value() 函数。常用的参数有 ascending 和 inplace，分别表示升序和直接修改原数据。具体参见以下示例。

import pandas as pd
size=100
data = pd.Series(np.random.normal(100000, 10000, size*3))
data.sort_values(ascending=False, inplace=True)
print(data)

输出：

size=100, max=118933, percentil=1.00%.
110    133736.289041
100    125750.005223
56     124638.880941
171    123683.549110
2      121613.786299
           ...
144     75828.751537
67      75643.456069
76      75477.452670
12      70625.992856
130     70300.131909
Length: 300, dtype: float64

8.2. 7.2 数据分组

分组函数 groupby可以将一个Series对象根据条件分成几个。分组的目的是根据索引或者根据值，将1个Series分成n个Series对象。所以分组的方法分为两类，一类是根据索引分组，另一类是根据值分组，实现代码如下所示。

import pandas as pd
# 根据值对其分组
print('x'.center(60, '-'))
for item in data.groupby(lambda x: x):
    print('len=', len(item[1]), 'index=', item[0])
    print('values:\n', item[1])
    #print(type(item), '\n', item)

print('data[x]'.center(60, '-'))
for item in data.groupby(lambda x: data[x]):
    print('len=', len(item[1]), 'value=', item[0])
    print('values:\n', item[1])
    #print(type(item), '\n', item)

输出：

-----------------------------x------------------------------
len= 1 index= 0
values:
 0    11
dtype: int64
len= 1 index= 1
values:
 1    11
dtype: int64
len= 1 index= 2
values:
 2    12
dtype: int64
--------------------------data[x]---------------------------
len= 2 value= 11
values:
 0    11
1    11
dtype: int64
len= 1 value= 12
values:
 2    12
dtype: int64

可以看到，我们在使用 groupby 函数时，可以使用 lambda 表达式作为参数进行数据分组。需要注意的是 lambda 的参数 x 代表的是索引。所以在使用是 x 就是索引分类，而 data[x] 就是按值分类。

9. 8 注意事项

空值（None）和任何值相加都会返回空值。
count之类的函数不统计空值（None）。（补充中……）
9.1. 值统计与索引输出

import pandas as pd
# 生成数据
data = pd.Series(np.random.normal(100000, 10000, 100))

# 数据内排序
data.sort_values(ascending=False, inplace=True)

# 对值进行统计，其中vc的类型是pandas.core.series.Series
vc=data.value_counts()

# Q：如何使用索引输出
for id in vc.index[:10]:
    print(id, vc[id])

# Q：如何遍历输出所有值
for value in vc:
    print(value)

# Q: 如何按比例对数据进行分组？
#result = data.aggregate(func = max())

9.2. 输入quantile百分比位置

函数 quantile(percent) 的作用是返回指定百分比位置索引，如前百分之10的最后一个索引。默认百分比为 50%。

print('quantile:')
for i in range(11):
    print('qr', int(data.quantile(0.1 * i)))

输出不同百分位的索引值。

9.3. 数据类型转换

使用 astype 函数，数据类型的转换不能在原数据上修改，必需生成新对象。 data = data.astype(int)

10. 9 示例

10.1. 9.1 将输入数列按10一段进行分组

import pandas as pd
# 随机生成数据
data = pd.Series(np.random.randint(0, 100, 30))

# 根据值对其分组 
for item in data.groupby(lambda x: data[x]//9):
    print('len=', len(item[1]), 'value=', item[0])
    print('values:\n', item[1])

输出

len= 3 value= 0
values:
 8     1
27    8
29    8
dtype: int32
len= 1 value= 1
values:
 9    17
dtype: int32
len= 2 value= 2
values:
 2     21
28    22
dtype: int32
len= 2 value= 3
values:
 7     32
15    34
dtype: int32
len= 6 value= 4
values:
 0     44
6     38
10    41
21    39
23    44
25    42
dtype: int32
len= 2 value= 5
values:
 22    48
26    53
dtype: int32
len= 1 value= 6
values:
 17    62
dtype: int32
len= 2 value= 7
values:
 3     70
13    67
dtype: int32
len= 6 value= 8
values:
 4     76
5     80
11    78
18    77
19    75
24    72
dtype: int32
len= 2 value= 9
values:
 14    81
20    82
dtype: int32
len= 3 value= 10
values:
 1     98
12    95
16    97
dtype: int32

10.2. 9.2 将域名和IP数据进行分组

现有一个Series对象，包括大量的IP和域名，现在使用正则进行分组。

import re
ip_pattern=r"(?:[0-9]{1,3}\.){3}[0-9]{1,3}"
data = pd.Series([
    'baidu.com',
    '192.168.5.30',
    'www.microsoft.com',
    '10.10.25.43',
    'www.sohu.com',
    'www.google.com'
])

grouped = data.groupby(lambda id: 'ip' if re.match(ip_pattern, data[id]) else 'domin')

for item in grouped:
    print('********* type: ', item[0], "********")
    for id in item[1].index:
        print('{0}\t{1}'.format(id, item[1][id]))

输出

********* type:  domin ********
0       baidu.com
2       www.microsoft.com
4       www.sohu.com
5       www.google.com
********* type:  ip ********
1       192.168.5.30
3       10.10.25.43

11. 参考

[1] pandas时间序列操作方法pd.date_range()，https://blog.csdn.net/missyougoon/article/details/83958749 [2] pd.Series 用法，https://www.cnblogs.com/sparkingplug/p/11409365.html [3] Pandas时间序列：生成指定范围的日期, https://blog.csdn.net/bqw18744018044/article/details/80920356