pandas常见函数详细使用
groupby函数
pandas提供了一个灵活高效的groupby功能,它使你能以一种自然的方式对数据集进行切片、切块、摘要等操作,根据一个或多个键(可以是函数、数组、Series或DataFrame列名)拆分pandas对象,继而计算分组摘要统计,如计数、平均值、标准差,或用户自定义函数。
SRE实战 互联网时代守护先锋,助力企业售后服务体系运筹帷幄!一键直达领取阿里云限量特价优惠。ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings', 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'], 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017], 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
1、根据series进行分组
按照Team进行分组,并计算Points列的平均值:我们可以先访问Points,并根据Team调用groupby:
grouped = df['Points'].groupby(df['Team']) #等价于df['Points'].groupby(df.Team) 以及 df['Points'].groupby(df.Team.values) print(grouped.groups) grouped.mean()
输出:

{'Devils': Int64Index([2, 3], dtype='int64'), 'Kings': Int64Index([4, 6, 7], dtype='int64'), 'Riders': Int64Index([0, 1, 8, 11], dtype='int64'), 'Royals': Int64Index([9, 10], dtype='int64'), 'kings': Int64Index([5], dtype='int64')} Team Devils 768.000000 Kings 761.666667 Riders 762.250000 Royals 752.500000 kings 812.000000 Name: Points, dtype: float64View Code
说明:数据(Series)根据分组键进行了聚合,产生了一个新的Series,其索引为Team列中的唯一值。
2、根据数组进行分组,实际上分组键可以是任何长度适当的数组(长度得等于行数),不一定得是series
grouped = df['Points'].groupby([df['Team'], df['Year']]) print(grouped.groups) age = np.array([2, 3, 3, 4, 5, 6, 7, 8, 8, 8, 8, 13]) grouped = df['Points'].groupby(age) print(grouped.groups)
输出:

{('Devils', 2014): Int64Index([2], dtype='int64'), ('Devils', 2015): Int64Index([3], dtype='int64'), ('Kings', 2014): Int64Index([4], dtype='int64'), ('Kings', 2016): Int64Index([6], dtype='int64'), ('Kings', 2017): Int64Index([7], dtype='int64'), ('Riders', 2014): Int64Index([0], dtype='int64'), ('Riders', 2015): Int64Index([1], dtype='int64'), ('Riders', 2016): Int64Index([8], dtype='int64'), ('Riders', 2017): Int64Index([11], dtype='int64'), ('Royals', 2014): Int64Index([9], dtype='int64'), ('Royals', 2015): Int64Index([10], dtype='int64'), ('kings', 2015): Int64Index([5], dtype='int64')} {2: Int64Index([0], dtype='int64'), 3: Int64Index([1, 2], dtype='int64'), 4: Int64Index([3], dtype='int64'), 5: Int64Index([4], dtype='int64'), 6: Int64Index([5], dtype='int64'), 7: Int64Index([6], dtype='int64'), 8: Int64Index([7, 8, 9, 10], dtype='int64'), 13: Int64Index([11], dtype='int64')}View Code
3、将列名作为分组键
grouped = df.groupby('Team') print(grouped.groups) print(grouped.mean())
输出:

{'Devils': Int64Index([2, 3], dtype='int64'), 'Kings': Int64Index([4, 6, 7], dtype='int64'), 'Riders': Int64Index([0, 1, 8, 11], dtype='int64'), 'Royals': Int64Index([9, 10], dtype='int64'), 'kings': Int64Index([5], dtype='int64')} Rank Year Points Team Devils 2.500000 2014.500000 768.000000 Kings 1.666667 2015.666667 761.666667 Riders 1.750000 2015.500000 762.250000 Royals 2.500000 2014.500000 752.500000 kings 4.000000 2015.000000 812.000000View Code
1、对groupby结果进行迭代
grouped = df.groupby(['Team', 'Year']) for (team, year), v in grouped: print (("{0}--{1}").format(team, year)) print(v)
2、选取一个或多个列
对于由DataFrame产生的GroupBy对象,如果用一个(单个字符串)或一组(字符串数组)列名对其进行索引,就能实现选取部分列进行聚合的目的
# 下面两行等价 print(df.groupby('Team')['Points'].groups) print(df['Points'].groupby(df['Team']).groups)

更多精彩