pandas进阶
pandas是基于numpy构建的库,在数据处理方面可以把它理解为numpy的加强版,由于numpy主要用于科学计算,特长不在于数据处理,我们平常处理的数据一般带有列标签和index索引,这时pandas作为数据分析包而被开发出来。
pandas数据结构(Series/DataFrame)
SRE实战 互联网时代守护先锋,助力企业售后服务体系运筹帷幄!一键直达领取阿里云限量特价优惠。一、Series
1、Series创建
Series类似一维数组的数据结构,由一组数据(各种numpy数据类型)和与之关联的数据标签(索引)组成,结构相当于定长有序的字典,index和value之间相互独立.
In [2]:import pandas as pd import numpy as npIn [3]:
# 创建Series a1 = pd.Series([1, 2, 3]) # 数组生成Series a1Out[3]:
0 1 1 2 2 3 dtype: int64In [4]:
a2 = pd.Series(np.array([1, 2, 3])) # numpy数组生成Series a2Out[4]:
0 1 1 2 2 3 dtype: int32In [5]:
a3 = pd.Series([1, 2, 3], index=["index1", "index2", "index3"]) # 指定标签index生成 a3Out[5]:
index1 1 index2 2 index3 3 dtype: int64In [6]:
a4 = pd.Series({"index1": 1, "index2": 2, "index3": 3}) # 字典生成Series a4Out[6]:
index1 1 index2 2 index3 3 dtype: int64In [8]:
a5 = pd.Series({"index": 1, "index2": 2, "index3": 3}, index=["index1", "index2", "index3"]) # 字典生成Series,指定index,不匹配部分为NaN a5Out[8]:
index1 NaN index2 2.0 index3 3.0 dtype: float64In [9]:
a6 = pd.Series(10, index=["index1", "index2", "index3"]) a6Out[9]:
index1 10 index2 10 index3 10 dtype: int64
2、Series属性
可以把Series看成一个定长的有序字典
可以通过shape(维度),size(长度),index(键),values(值)等得到series的属性
In [10]:a1 = pd.Series([1, 2, 3]) a1.index # Series索引Out[10]:
RangeIndex(start=0, stop=3, step=1)In [12]:
a1.values # Series数值Out[12]:
array([1, 2, 3], dtype=int64)In [13]:
a1.name = "population" # 指定Series名字 a1.index.name = "state" # 指定Series索引名字 a1Out[13]:
state 0 1 1 2 2 3 Name: population, dtype: int64In [14]:
a1.shapeOut[14]:
(3,)In [15]:
a1.sizeOut[15]:
3
3、Series查找元素
loc为显示切片(通过键),iloc为隐式切片(通过索引)
访问单个元素
s[indexname]
s.loc[indexname] 推荐
s[loc]
s.iloc[loc] 推荐<
访问多个元素
s[[indexname1,indexname2]]
s.loc[[indexname1,indexname2]] 推荐
s[[loc1,loc2]]
s.iloc[[loc1,loc2]] 推荐
a3 = pd.Series([1, 2, 3], index=["index1", "index2", "index3"]) a3Out[17]:
index1 1 index2 2 index3 3 dtype: int64In [18]:
a3["index1"]Out[18]:
1In [19]:
a3.loc['index1']Out[19]:
1In [20]:
a3[1]Out[20]:
2In [22]:
a3.iloc[1]Out[22]:
2In [23]:
a3[['index1','index2']]Out[23]:
index1 1 index2 2 dtype: int64In [24]:
a3.loc[['index1','index2']]Out[24]:
index1 1 index2 2 dtype: int64In [25]:
a3[[1,2]]Out[25]:
index2 2 index3 3 dtype: int64In [26]:
a3.iloc[[1,2]]Out[26]:
index2 2 index3 3 dtype: int64In [27]:
a3[a3 > np.mean(a3)] # 布尔值查找元素Out[27]:
index3 3 dtype: int64In [28]:
a3[0:2] # 绝对位置切片Out[28]:
index1 1 index2 2 dtype: int64In [30]:
a3["index1":"index2"] # 索引切片Out[30]:
index1 1 index2 2 dtype: int64
4、Series修改元素
In [32]:
# 修改元素 a3["index3"] = 100 # 按照索引修改元素 a3Out[32]:
index1 1 index2 2 index3 100 dtype: int64In [33]:
a3[2] = 1000 # 按照绝对位置修改元素 a3Out[33]:
index1 1 index2 2 index3 1000 dtype: int64
5、Series添加元素
In [34]:
# 添加元素 a3["index4"] = 10 # 按照索引添加元素 a3Out[34]:
index1 1 index2 2 index3 1000 index4 10 dtype: int64
6、Series删除元素
In [35]:
a3.drop(["index4", "index3"], inplace=True) # inplace=True表示作用在当前Series a3Out[35]:
index1 1 index2 2 dtype: int64
7、Series方法
In [36]:
a3 = pd.Series([1, 2, 3], index=["index1", "index2", "index3"]) a3["index3"] = np.NaN # 添加元素 a3Out[36]:
index1 1.0 index2 2.0 index3 NaN dtype: float64In [37]:
a3.isnull() # 判断Series是否有缺失值Out[37]:
index1 False index2 False index3 True dtype: boolIn [38]:
a3.notnull() # 判断Series是否没有缺失值Out[38]:
index1 True index2 True index3 False dtype: boolIn [39]:
"index1" in a3 # 判断Series中某个索引是否存在Out[39]:
TrueIn [47]:
a3.isin([1,2]) # 判断Series中某个值是否存在Out[47]:
index1 True index2 True index3 False dtype: boolIn [48]:
a3.unique() # 统计Series中去重元素Out[48]:
array([ 1., 2., nan])In [49]:
a3.value_counts() # 统计Series中去重元素和个数Out[49]:
2.0 1 1.0 1 dtype: int64
二、Dataframe
DataFrame是一个【表格型】的数据结构,可以看做是【由Series组成的字典】(共用同一个索引)。DataFrame由按一定顺序排列的多列数据组成。设计初衷是将Series的使用场景从一维拓展到多维。DataFrame既有行索引,也有列索引。
行索引:index
列索引:columns
值:values(numpy的二维数组)
1、创建DataFrame
1.1通过字典创建
In [50]:data = {"color": ["green", "red", "blue", "black", "yellow"], "price": [1, 2, 3, 4, 5]} dataFrame1 = pd.DataFrame(data=data) # 通过字典创建 dataFrame1Out[50]:
color | price | |
---|---|---|
0 | green | 1 |
1 | red | 2 |
2 | blue | 3 |
3 | black | 4 |
4 | yellow | 5 |
dataFrame2 = pd.DataFrame(data=data, index=["index1", "index2", "index3", "index4", "index5"]) dataFrame2Out[51]:
color | price | |
---|---|---|
index1 | green | 1 |
index2 | red | 2 |
index3 | blue | 3 |
index4 | black | 4 |
index5 | yellow | 5 |
dataFrame3 = pd.DataFrame(data=data, index=["index1", "index2", "index3", "index4", "index5"], columns=["price"]) # 指定列索引 dataFrame3Out[52]:
price | |
---|---|
index1 | 1 |
index2 | 2 |
index3 | 3 |
index4 | 4 |
index5 | 5 |
dataFrame4 = pd.DataFrame(data=np.arange(12).reshape(3, 4)) # 通过numpy数组创建 dataFrame4Out[53]:
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 0 | 1 | 2 | 3 |
1 | 4 | 5 | 6 | 7 |
2 | 8 | 9 | 10 | 11 |
dic = { '张三':[150,150,150,300], '李四':[0,0,0,0] } pd.DataFrame(data=dic,index=['语文','数学','英语','理综'])Out[54]:
张三 | 李四 | |
---|---|---|
语文 | 150 | 0 |
数学 | 150 | 0 |
英语 | 150 | 0 |
理综 | 300 | 0 |
data = [[0,150],[0,150],[0,150],[0,300]] index = ['语文','数学','英语','理综'] columns = ['李四','张三'] pd.DataFrame(data=data,index=index,columns=columns)Out[56]:
李四 | 张三 | |
---|---|---|
语文 | 0 | 150 |
数学 | 0 | 150 |
英语 | 0 | 150 |
理综 | 0 | 300 |
1.2通过Series创建
In [59]:cars = pd.Series({"Beijing": 300000, "Shanghai": 350000, "Shenzhen": 300000, "Tianjian": 200000, "Guangzhou": 250000, "Chongqing": 150000}) carsOut[59]:
Beijing 300000 Shanghai 350000 Shenzhen 300000 Tianjian 200000 Guangzhou 250000 Chongqing 150000 dtype: int64In [60]:
cities = {"Shanghai": 90000, "Foshan": 4500, "Dongguan": 5500, "Beijing": 6600, "Nanjing": 8000, "Lanzhou": None} apts = pd.Series(cities, name="price") aptsOut[60]:
Shanghai 90000.0 Foshan 4500.0 Dongguan 5500.0 Beijing 6600.0 Nanjing 8000.0 Lanzhou NaN Name: price, dtype: float64In [61]:
df = pd.DataFrame({"apts": apts, "cars": cars}) dfOut[61]:
apts | cars | |
---|---|---|
Beijing | 6600.0 | 300000.0 |
Chongqing | NaN | 150000.0 |
Dongguan | 5500.0 | NaN |
Foshan | 4500.0 | NaN |
Guangzhou | NaN | 250000.0 |
Lanzhou | NaN | NaN |
Nanjing | 8000.0 | NaN |
Shanghai | 90000.0 | 350000.0 |
Shenzhen | NaN | 300000.0 |
Tianjian | NaN | 200000.0 |
1.3通过dicts的list来构建Dataframe
In [62]:data = [{"Beijing": 1000, "Shanghai": 2500, "Nanjing": 9850}, {"Beijing": 5000, "Shanghai": 4600, "Nanjing": 7000}] pd.DataFrame(data)Out[62]:
Beijing | Nanjing | Shanghai | |
---|---|---|---|
0 | 1000 | 9850 | 2500 |
1 | 5000 | 7000 | 4600 |
2、查找DataFrame中的元素
In [65]:
data = {"color": ["green", "red", "blue", "black", "yellow"], "price": [1, 2, 3, 4, 5]} dataFrame2 = pd.DataFrame(data=data, index=["index1", "index2", "index3", "index4", "index5"]) dataFrame2Out[65]:
color | price | |
---|---|---|
index1 | green | 1 |
index2 | red | 2 |
index3 | blue | 3 |
index4 | black | 4 |
index5 | yellow | 5 |
dataFrame2.columns # 查找dataFrame中所有列标签Out[66]:
Index(['color', 'price'], dtype='object')In [67]:
dataFrame2.index # 查找dataFrame中的所有行标签Out[67]:
Index(['index1', 'index2', 'index3', 'index4', 'index5'], dtype='object')In [68]:
dataFrame2.values # 查找dataFrame中的所有值Out[68]:
array([['green', 1], ['red', 2], ['blue', 3], ['black', 4], ['yellow', 5]], dtype=object)In [72]:
dataFrame2["color"]["index1"] # 索引查找数值(先列后行,否则报错)Out[72]:
'green'In [73]:
dataFrame2.at["index1", "color"] # 索引查找数值(先行后列,否则报错)Out[73]:
'green'In [79]:
dataFrame2.iat[0, 1] # 绝对位置查找数值Out[79]:
1
3、查找DataFrame中某一行/列元素
In [89]:
data = {"color": ["green", "red", "blue", "black", "yellow"], "price": [1, 2, 3, 4, 5]} dataFrame2 = pd.DataFrame(data=data, index=["index1", "index2", "index3", "index4", "index5"]) dataFrame2Out[89]:
color | price | |
---|---|---|
index1 | green | 1 |
index2 | red | 2 |
index3 | blue | 3 |
index4 | black | 4 |
index5 | yellow | 5 |
dataFrame2.loc["index1"] # 查找一行元素Out[91]:
color green price 1 Name: index1, dtype: objectIn [92]:
dataFrame2.iloc[0] # 查找一行元素(绝对位置)Out[92]:
color green price 1 Name: index1, dtype: objectIn [96]:
dataFrame2.iloc[0:2] # 通过iloc方法可以拿到行和列,直接按照index的顺序来取。# 可以当做numpy的ndarray的二维数组来操作。Out[96]:
color | price | |
---|---|---|
index1 | green | 1 |
index2 | red | 2 |
dataFrame2.loc[:, "price"] # 查找一列元素Out[100]:
index1 1 index2 2 index3 3 index4 4 index5 5 Name: price, dtype: int64In [101]:
dataFrame2.iloc[:, 0] # 查找一列元素(绝对位置)Out[101]:
index1 green index2 red index3 blue index4 black index5 yellow Name: color, dtype: objectIn [102]:
dataFrame2.values[0] # 查找一行元素Out[102]:
array(['green', 1], dtype=object)In [103]:
dataFrame2["price"] # 查找一列元素,#通过列名的方式,查找列,不能查找行Out[103]:
index1 1 index2 2 index3 3 index4 4 index5 5 Name: price, dtype: int64In [104]:
dataFrame2["color"]Out[104]:
index1 green index2 red index3 blue index4 black index5 yellow Name: color, dtype: object
4、查找DataFrame中的多行/列元素
In [106]:
dataFrame2.head(5) # 查看前5行元素Out[106]:
color | price | |
---|---|---|
index1 | green | 1 |
index2 | red | 2 |
index3 | blue | 3 |
index4 | black | 4 |
index5 | yellow | 5 |
dataFrame2.tail(5) # 查看后5行元素Out[107]:
color | price | |
---|---|---|
index1 | green | 1 |
index2 | red | 2 |
index3 | blue | 3 |
index4 | black | 4 |
index5 | yellow | 5 |
dataFrame2["index1":"index4"] # 切片多行Out[108]:
color | price | |
---|---|---|
index1 | green | 1 |
index2 | red | 2 |
index3 | blue | 3 |
index4 | black | 4 |
dataFrame2[0:4] # 切片多行Out[109]:
color | price | |
---|---|---|
index1 | green | 1 |
index2 | red | 2 |
index3 | blue | 3 |
index4 | black | 4 |
dataFrame2.loc[["index1", "index2"]] # 多行Out[111]:
color | price | |
---|---|---|
index1 | green | 1 |
index2 | red | 2 |
dataFrame2.iloc[[0, 1]] # 多行Out[113]:
color | price | |
---|---|---|
index1 | green | 1 |
index2 | red | 2 |
dataFrame2.loc[:, ["price"]] # 多列Out[114]:
price | |
---|---|
index1 | 1 |
index2 | 2 |
index3 | 3 |
index4 | 4 |
index5 | 5 |
dataFrame2.iloc[:, [0, 1]] # 多列Out[115]:
color | price | |
---|---|---|
index1 | green | 1 |
index2 | red | 2 |
index3 | blue | 3 |
index4 | black | 4 |
index5 | yellow | 5 |
dataFrame2.loc[["index1", "index3"], ["price"]] # 索引查找Out[116]:
price | |
---|---|
index1 | 1 |
index3 | 3 |
dataFrame2.iloc[[1, 2], [0]] # 绝对位置查找Out[117]:
color | |
---|---|
index2 | red |
index3 | blue |
5、添加一行/列元素
In [119]:
dataFrame2.loc["index6"] = ["pink", 3] dataFrame2Out[119]:
color | price | |
---|---|---|
index1 | green | 1 |
index2 | red | 2 |
index3 | blue | 3 |
index4 | black | 4 |
index5 | yellow | 5 |
index6 | pink | 3 |
dataFrame2.loc["index6"]=10 dataFrame2Out[120]:
color | price | |
---|---|---|
index1 | green | 1 |
index2 | red | 2 |
index3 | blue | 3 |
index4 | black | 4 |
index5 | yellow | 5 |
index6 | 10 | 10 |
dataFrame2.iloc[5] = 10 dataFrame2Out[123]:
color | price | |
---|---|---|
index1 | green | 1 |
index2 | red | 2 |
index3 | blue | 3 |
index4 | black | 4 |
index5 | yellow | 5 |
index6 | 10 | 10 |
dataFrame2.loc["index7"] = 100 dataFrame2Out[125]:
color | price | |
---|---|---|
index1 | green | 1 |
index2 | red | 2 |
index3 | blue | 3 |
index4 | black | 4 |
index5 | yellow | 5 |
index6 | 10 | 10 |
index7 | 100 | 100 |
dataFrame2.loc[:, "size"] = "small" dataFrame2Out[129]:
color | price | size | |
---|---|---|---|
index1 | green | 1 | small |
index2 | red | 2 | small |
index3 | blue | 3 | small |
index4 | black | 4 | small |
index5 | yellow | 5 | small |
index6 | 10 | 10 | small |
index7 | 100 | 100 | small |
dataFrame2.iloc[:, 2] = 10 dataFrame2Out[130]:
color | price | size | |
---|---|---|---|
index1 | green | 1 | 10 |
index2 | red | 2 | 10 |
index3 | blue | 3 | 10 |
index4 | black | 4 | 10 |
index5 | yellow | 5 | 10 |
index6 | 10 | 10 | 10 |
index7 | 100 | 100 | 10 |
6、修改元素
In [131]:
dataFrame2.loc["index1", "price"] = 100 dataFrame2Out[131]:
color | price | size | |
---|---|---|---|
index1 | green | 100 | 10 |
index2 | red | 2 | 10 |
index3 | blue | 3 | 10 |
index4 | black | 4 | 10 |
index5 | yellow | 5 | 10 |
index6 | 10 | 10 | 10 |
index7 | 100 | 100 | 10 |
dataFrame2.iloc[0, 1] = 10 dataFrame2Out[132]:
color | price | size | |
---|---|---|---|
index1 | green | 10 | 10 |
index2 | red | 2 | 10 |
index3 | blue | 3 | 10 |
index4 | black | 4 | 10 |
index5 | yellow | 5 | 10 |
index6 | 10 | 10 | 10 |
index7 | 100 | 100 | 10 |
dataFrame2.at["index1", "price"] = 100 dataFrame2Out[133]:
color | price | size | |
---|---|---|---|
index1 | green | 100 | 10 |
index2 | red | 2 | 10 |
index3 | blue | 3 | 10 |
index4 | black | 4 | 10 |
index5 | yellow | 5 | 10 |
index6 | 10 | 10 | 10 |
index7 | 100 | 100 | 10 |
dataFrame2.iat[0, 1] = 1000 dataFrame2Out[135]:
color | price | size | |
---|---|---|---|
index1 | green | 1000 | 10 |
index2 | red | 2 | 10 |
index3 | blue | 3 | 10 |
index4 | black | 4 | 10 |
index5 | yellow | 5 | 10 |
index6 | 10 | 10 | 10 |
index7 | 100 | 100 | 10 |
7、删除元素
In [136]:
dataFrame2.drop(["index6", "index7"], inplace=True) # inplace=True表示作用在原数组 dataFrame2Out[136]:
color | price | size | |
---|---|---|---|
index1 | green | 1000 | 10 |
index2 | red | 2 | 10 |
index3 | blue | 3 | 10 |
index4 | black | 4 | 10 |
index5 | yellow | 5 | 10 |
a=dataFrame2.drop(["price"], axis=1, inplace=False) dataFrame2Out[141]:
color | price | |
---|---|---|
index1 | green | 1000 |
index2 | red | 2 |
index3 | blue | 3 |
index4 | black | 4 |
index5 | yellow | 5 |
a
Out[142]:
color | |
---|---|
index1 | green |
index2 | red |
index3 | blue |
index4 | black |
index5 | yellow |
8. 处理NaN数据
In [148]:
dates = pd.date_range('20180101', periods=3) df = pd.DataFrame(np.arange(12).reshape((3, 4)), index=dates, columns=['a', 'b', 'c', 'd']) df.iloc[1, 1], df.iloc[2, 2] = np.nan, np.nan dfOut[148]:
a | b | c | d | |
---|---|---|---|---|
2018-01-01 | 0 | 1.0 | 2.0 | 3 |
2018-01-02 | 4 | NaN | 6.0 | 7 |
2018-01-03 | 8 | 9.0 | NaN | 11 |
8.1删除NaN数据
In [151]:re=df.dropna(axis=1, inplace=False) # inplace默认为false dfOut[151]:
a | b | c | d | |
---|---|---|---|---|
2018-01-01 | 0 | 1.0 | 2.0 | 3 |
2018-01-02 | 4 | NaN | 6.0 | 7 |
2018-01-03 | 8 | 9.0 | NaN | 11 |
re
Out[152]:
a | d | |
---|---|---|
2018-01-01 | 0 | 3 |
2018-01-02 | 4 | 7 |
2018-01-03 | 8 | 11 |
8.2填充NaN数据
In [153]:
re2 = df.fillna(value='*') re2Out[153]:
a | b | c | d | |
---|---|---|---|---|
2018-01-01 | 0 | 1 | 2 | 3 |
2018-01-02 | 4 | * | 6 | 7 |
2018-01-03 | 8 | 9 | * | 11 |
8.3 检查是否存在NaN
In [155]:
df.isnull()Out[155]:
a | b | c | d | |
---|---|---|---|---|
2018-01-01 | False | False | False | False |
2018-01-02 | False | True | False | False |
2018-01-03 | False | False | True | False |
9.合并DataFrame
9.1 concat函数
In [156]:df1 = pd.DataFrame(np.ones((3, 4)) * 0, columns=['a', 'b', 'c', 'd']) df1Out[156]:
a | b | c | d | |
---|---|---|---|---|
0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 0.0 | 0.0 | 0.0 | 0.0 |
df2 = pd.DataFrame(np.ones((3, 4)) * 1, columns=['a', 'b', 'c', 'd']) df2Out[157]:
a | b | c | d | |
---|---|---|---|---|
0 | 1.0 | 1.0 | 1.0 | 1.0 |
1 | 1.0 | 1.0 | 1.0 | 1.0 |
2 | 1.0 | 1.0 | 1.0 | 1.0 |
df3 = pd.DataFrame(np.ones((3, 4)) * 2, columns=['a', 'b', 'c', 'd']) df3Out[158]:
a | b | c | d | |
---|---|---|---|---|
0 | 2.0 | 2.0 | 2.0 | 2.0 |
1 | 2.0 | 2.0 | 2.0 | 2.0 |
2 | 2.0 | 2.0 | 2.0 | 2.0 |
# ignore_index=True将重新对index排序 pd.concat([df1, df2, df3], axis=0, ignore_index=True)Out[159]:
a | b | c | d | |
---|---|---|---|---|
0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 1.0 | 1.0 | 1.0 | 1.0 |
4 | 1.0 | 1.0 | 1.0 | 1.0 |
5 | 1.0 | 1.0 | 1.0 | 1.0 |
6 | 2.0 | 2.0 | 2.0 | 2.0 |
7 | 2.0 | 2.0 | 2.0 | 2.0 |
8 | 2.0 | 2.0 | 2.0 | 2.0 |
# ignore_index=True将重新对index排序 pd.concat([df1, df2, df3], axis=0, ignore_index=False)Out[160]:
a | b | c | d | |
---|---|---|---|---|
0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 0.0 | 0.0 | 0.0 | 0.0 |
0 | 1.0 | 1.0 | 1.0 | 1.0 |
1 | 1.0 | 1.0 | 1.0 | 1.0 |
2 | 1.0 | 1.0 | 1.0 | 1.0 |
0 | 2.0 | 2.0 | 2.0 | 2.0 |
1 | 2.0 | 2.0 | 2.0 | 2.0 |
2 | 2.0 | 2.0 | 2.0 | 2.0 |
join参数用法
In [164]:df1 = pd.DataFrame(np.ones((3, 4)) * 0, columns=['a', 'b', 'c', 'd'], index=[1, 2, 3]) df2 = pd.DataFrame(np.ones((3, 4)) * 1, columns=['b', 'c', 'd', 'e'], index=[2, 3, 4]) # join默认为'outer',不共有的列用NaN填充 pd.concat([df1, df2], sort=False, join='outer')Out[164]:
a | b | c | d | e | |
---|---|---|---|---|---|
1 | 0.0 | 0.0 | 0.0 | 0.0 | NaN |
2 | 0.0 | 0.0 | 0.0 | 0.0 | NaN |
3 | 0.0 | 0.0 | 0.0 | 0.0 | NaN |
2 | NaN | 1.0 | 1.0 | 1.0 | 1.0 |
3 | NaN | 1.0 | 1.0 | 1.0 | 1.0 |
4 | NaN | 1.0 | 1.0 | 1.0 | 1.0 |
# join='inner'只合并共有的列 pd.concat([df1, df2], sort=False, join='inner',ignore_index=True)Out[166]:
b | c | d | |
---|---|---|---|
0 | 0.0 | 0.0 | 0.0 |
1 | 0.0 | 0.0 | 0.0 |
2 | 0.0 | 0.0 | 0.0 |
3 | 1.0 | 1.0 | 1.0 |
4 | 1.0 | 1.0 | 1.0 |
5 | 1.0 | 1.0 | 1.0 |
join_axes参数用法
In [167]:# 按照df1的index进行合并 pd.concat([df1, df2], axis=1, join_axes=[df1.index])Out[167]:
a | b | c | d | b | c | d | e | |
---|---|---|---|---|---|---|---|---|
1 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | NaN | NaN | NaN |
2 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 |
3 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 |
9.2 append函数¶
In [169]:df1 = pd.DataFrame(np.ones((3, 4)) * 0, columns=['a', 'b', 'c', 'd']) df2 = pd.DataFrame(np.ones((3, 4)) * 1, columns=['a', 'b', 'c', 'd']) re = df1.append(df2, ignore_index=True) reOut[169]:
a | b | c | d | |
---|---|---|---|---|
0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 1.0 | 1.0 | 1.0 | 1.0 |
4 | 1.0 | 1.0 | 1.0 | 1.0 |
5 | 1.0 | 1.0 | 1.0 | 1.0 |
append一组数据
In [170]:df1 = pd.DataFrame(np.ones((3, 4)) * 0, columns=['a', 'b', 'c', 'd']) s = pd.Series([4, 4, 4, 4], index=['a', 'b', 'c', 'd']) re = df1.append(s, ignore_index=True) reOut[170]:
a | b | c | d | |
---|---|---|---|---|
0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 4.0 | 4.0 | 4.0 | 4.0 |
9.3 merge函数
基于某一列进行合并
In [171]:df1 = pd.DataFrame({'A': ['A1', 'A2', 'A3'], 'B': ['B1', 'B2', 'B3'], 'KEY': ['K1', 'K2', 'K3']}) df2 = pd.DataFrame({'C': ['C1', 'C2', 'C3'], 'D': ['D1', 'D2', 'D3'], 'KEY': ['K1', 'K2', 'K3']}) df1Out[171]:
A | B | KEY | |
---|---|---|---|
0 | A1 | B1 | K1 |
1 | A2 | B2 | K2 |
2 | A3 | B3 | K3 |
df2
Out[172]:
C | D | KEY | |
---|---|---|---|
0 | C1 | D1 | K1 |
1 | C2 | D2 | K2 |
2 | C3 | D3 | K3 |
re = pd.merge(df1, df2, on='KEY') reOut[173]:
A | B | KEY | C | D | |
---|---|---|---|---|---|
0 | A1 | B1 | K1 | C1 | D1 |
1 | A2 | B2 | K2 | C2 | D2 |
2 | A3 | B3 | K3 | C3 | D3 |
基于某两列进行合并
In [175]:df1 = pd.DataFrame({'A': ['A1', 'A2', 'A3'], 'B': ['B1', 'B2', 'B3'], 'KEY1': ['K1', 'K2', 'K0'], 'KEY2': ['K0', 'K1', 'K3']}) df2 = pd.DataFrame({'C': ['C1', 'C2', 'C3'], 'D': ['D1', 'D2', 'D3'], 'KEY1': ['K0', 'K2', 'K1'], 'KEY2': ['K1', 'K1', 'K0']}) # how:['left','right','outer','inner'] re = pd.merge(df1, df2, on=['KEY1', 'KEY2'], how='inner') reOut[175]:
A | B | KEY1 | KEY2 | C | D | |
---|---|---|---|---|---|---|
0 | A1 | B1 | K1 | K0 | C3 | D3 |
1 | A2 | B2 | K2 | K1 | C2 | D2 |
按index合并
In [176]:df1 = pd.DataFrame({'A': ['A1', 'A2', 'A3'], 'B': ['B1', 'B2', 'B3']}, index=['K0', 'K1', 'K2']) df2 = pd.DataFrame({'C': ['C1', 'C2', 'C3'], 'D': ['D1', 'D2', 'D3']}, index=['K0', 'K1', 'K3']) re = pd.merge(df1, df2, left_index=True, right_index=True, how='outer') reOut[176]:
A | B | C | D | |
---|---|---|---|---|
K0 | A1 | B1 | C1 | D1 |
K1 | A2 | B2 | C2 | D2 |
K2 | A3 | B3 | NaN | NaN |
K3 | NaN | NaN | C3 | D3 |
为列加后缀
In [177]:df_boys = pd.DataFrame({'id': ['1', '2', '3'], 'age': ['23', '25', '18']}) df_girls = pd.DataFrame({'id': ['1', '2', '3'], 'age': ['18', '18', '18']}) re = pd.merge(df_boys, df_girls, on='id', suffixes=['_boys', '_girls']) reOut[177]:
id | age_boys | age_girls | |
---|---|---|---|
0 | 1 | 23 | 18 |
1 | 2 | 25 | 18 |
2 | 3 | 18 | 18 |
