Pandas高級教程之Pandas中的GroupBy操作
簡介
pandas中的DF數據類型可以像數據庫表格一樣進行groupby操作。通常來說groupby操作可以分為三部分:分割數據,應用變換和和合並數據。
本文將會詳細講解Pandas中的groupby操作。
分割數據
分割數據的目的是將DF分割成為一個個的group。為瞭進行groupby操作,在創建DF的時候需要指定相應的label:
df = pd.DataFrame( ...: { ...: "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"], ...: "B": ["one", "one", "two", "three", "two", "two", "one", "three"], ...: "C": np.random.randn(8), ...: "D": np.random.randn(8), ...: } ...: ) ...: df Out[61]: A B C D 0 foo one -0.490565 -0.233106 1 bar one 0.430089 1.040789 2 foo two 0.653449 -1.155530 3 bar three -0.610380 -0.447735 4 foo two -0.934961 0.256358 5 bar two -0.256263 -0.661954 6 foo one -1.132186 -0.304330 7 foo three 2.129757 0.445744
默認情況下,groupby的軸是x軸。可以一列group,也可以多列group:
In [8]: grouped = df.groupby("A") In [9]: grouped = df.groupby(["A", "B"])
多index
在0.24版本中,如果我們有多index,可以從中選擇特定的index進行group:
In [10]: df2 = df.set_index(["A", "B"]) In [11]: grouped = df2.groupby(level=df2.index.names.difference(["B"])) In [12]: grouped.sum() Out[12]: C D A bar -1.591710 -1.739537 foo -0.752861 -1.402938
get_group
get_group 可以獲取分組之後的數據:
In [24]: df3 = pd.DataFrame({"X": ["A", "B", "A", "B"], "Y": [1, 4, 3, 2]}) In [25]: df3.groupby(["X"]).get_group("A") Out[25]: X Y 0 A 1 2 A 3 In [26]: df3.groupby(["X"]).get_group("B") Out[26]: X Y 1 B 4 3 B 2
dropna
默認情況下,NaN數據會被排除在groupby之外,通過設置 dropna=False 可以允許NaN數據:
In [27]: df_list = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]] In [28]: df_dropna = pd.DataFrame(df_list, columns=["a", "b", "c"]) In [29]: df_dropna Out[29]: a b c 0 1 2.0 3 1 1 NaN 4 2 2 1.0 3 3 1 2.0 2
# Default ``dropna`` is set to True, which will exclude NaNs in keys In [30]: df_dropna.groupby(by=["b"], dropna=True).sum() Out[30]: a c b 1.0 2 3 2.0 2 5 # In order to allow NaN in keys, set ``dropna`` to False In [31]: df_dropna.groupby(by=["b"], dropna=False).sum() Out[31]: a c b 1.0 2 3 2.0 2 5 NaN 1 4
groups屬性
groupby對象有個groups屬性,它是一個key-value字典,key是用來分類的數據,value是分類對應的值。
In [34]: grouped = df.groupby(["A", "B"]) In [35]: grouped.groups Out[35]: {('bar', 'one'): [1], ('bar', 'three'): [3], ('bar', 'two'): [5], ('foo', 'one'): [0, 6], ('foo', 'three'): [7], ('foo', 'two'): [2, 4]} In [36]: len(grouped) Out[36]: 6
index的層級
對於多級index對象,groupby可以指定group的index層級:
In [40]: arrays = [ ....: ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"], ....: ["one", "two", "one", "two", "one", "two", "one", "two"], ....: ] ....: In [41]: index = pd.MultiIndex.from_arrays(arrays, names=["first", "second"]) In [42]: s = pd.Series(np.random.randn(8), index=index) In [43]: s Out[43]: first second bar one -0.919854 two -0.042379 baz one 1.247642 two -0.009920 foo one 0.290213 two 0.495767 qux one 0.362949 two 1.548106 dtype: float64
group第一級:
In [44]: grouped = s.groupby(level=0) In [45]: grouped.sum() Out[45]: first bar -0.962232 baz 1.237723 foo 0.785980 qux 1.911055 dtype: float64
group第二級:
In [46]: s.groupby(level="second").sum() Out[46]: second one 0.980950 two 1.991575 dtype: float64
group的遍歷
得到group對象之後,我們可以通過for語句來遍歷group:
In [62]: grouped = df.groupby('A') In [63]: for name, group in grouped: ....: print(name) ....: print(group) ....: bar A B C D 1 bar one 0.254161 1.511763 3 bar three 0.215897 -0.990582 5 bar two -0.077118 1.211526 foo A B C D 0 foo one -0.575247 1.346061 2 foo two -1.143704 1.627081 4 foo two 1.193555 -0.441652 6 foo one -0.408530 0.268520 7 foo three -0.862495 0.024580
如果是多字段group,group的名字是一個元組:
In [64]: for name, group in df.groupby(['A', 'B']): ....: print(name) ....: print(group) ....: ('bar', 'one') A B C D 1 bar one 0.254161 1.511763 ('bar', 'three') A B C D 3 bar three 0.215897 -0.990582 ('bar', 'two') A B C D 5 bar two -0.077118 1.211526 ('foo', 'one') A B C D 0 foo one -0.575247 1.346061 6 foo one -0.408530 0.268520 ('foo', 'three') A B C D 7 foo three -0.862495 0.02458 ('foo', 'two') A B C D 2 foo two -1.143704 1.627081 4 foo two 1.193555 -0.441652
聚合操作
分組之後,就可以進行聚合操作:
In [67]: grouped = df.groupby("A") In [68]: grouped.aggregate(np.sum) Out[68]: C D A bar 0.392940 1.732707 foo -1.796421 2.824590 In [69]: grouped = df.groupby(["A", "B"]) In [70]: grouped.aggregate(np.sum) Out[70]: C D A B bar one 0.254161 1.511763 three 0.215897 -0.990582 two -0.077118 1.211526 foo one -0.983776 1.614581 three -0.862495 0.024580 two 0.049851 1.185429
對於多index數據來說,默認返回值也是多index的。如果想使用新的index,可以添加 as_index = False:
In [71]: grouped = df.groupby(["A", "B"], as_index=False) In [72]: grouped.aggregate(np.sum) Out[72]: A B C D 0 bar one 0.254161 1.511763 1 bar three 0.215897 -0.990582 2 bar two -0.077118 1.211526 3 foo one -0.983776 1.614581 4 foo three -0.862495 0.024580 5 foo two 0.049851 1.185429 In [73]: df.groupby("A", as_index=False).sum() Out[73]: A C D 0 bar 0.392940 1.732707 1 foo -1.796421 2.824590
上面的效果等同於reset_index
In [74]: df.groupby(["A", "B"]).sum().reset_index()
grouped.size() 計算group的大小:
In [75]: grouped.size() Out[75]: A B size 0 bar one 1 1 bar three 1 2 bar two 1 3 foo one 2 4 foo three 1 5 foo two 2
grouped.describe() 描述group的信息:
In [76]: grouped.describe() Out[76]: C ... D count mean std min 25% 50% ... std min 25% 50% 75% max 0 1.0 0.254161 NaN 0.254161 0.254161 0.254161 ... NaN 1.511763 1.511763 1.511763 1.511763 1.511763 1 1.0 0.215897 NaN 0.215897 0.215897 0.215897 ... NaN -0.990582 -0.990582 -0.990582 -0.990582 -0.990582 2 1.0 -0.077118 NaN -0.077118 -0.077118 -0.077118 ... NaN 1.211526 1.211526 1.211526 1.211526 1.211526 3 2.0 -0.491888 0.117887 -0.575247 -0.533567 -0.491888 ... 0.761937 0.268520 0.537905 0.807291 1.076676 1.346061 4 1.0 -0.862495 NaN -0.862495 -0.862495 -0.862495 ... NaN 0.024580 0.024580 0.024580 0.024580 0.024580 5 2.0 0.024925 1.652692 -1.143704 -0.559389 0.024925 ... 1.462816 -0.441652 0.075531 0.592714 1.109898 1.627081 [6 rows x 16 columns]
通用聚合方法
下面是通用的聚合方法:
函數 | 描述 |
---|---|
mean() |
平均值 |
sum() |
求和 |
size() |
計算size |
count() |
group的統計 |
std() |
標準差 |
var() |
方差 |
sem() |
均值的標準誤 |
describe() |
統計信息描述 |
first() |
第一個group值 |
last() |
最後一個group值 |
nth() |
第n個group值 |
min() |
最小值 |
max() |
最大值 |
可以同時指定多個聚合方法:
In [81]: grouped = df.groupby("A") In [82]: grouped["C"].agg([np.sum, np.mean, np.std]) Out[82]: sum mean std A bar 0.392940 0.130980 0.181231 foo -1.796421 -0.359284 0.912265
可以重命名:
In [84]: ( ....: grouped["C"] ....: .agg([np.sum, np.mean, np.std]) ....: .rename(columns={"sum": "foo", "mean": "bar", "std": "baz"}) ....: ) ....: Out[84]: foo bar baz A bar 0.392940 0.130980 0.181231 foo -1.796421 -0.359284 0.912265
NamedAgg
NamedAgg 可以對聚合進行更精準的定義,它包含 column 和aggfunc 兩個定制化的字段。
In [88]: animals = pd.DataFrame( ....: { ....: "kind": ["cat", "dog", "cat", "dog"], ....: "height": [9.1, 6.0, 9.5, 34.0], ....: "weight": [7.9, 7.5, 9.9, 198.0], ....: } ....: ) ....: In [89]: animals Out[89]: kind height weight 0 cat 9.1 7.9 1 dog 6.0 7.5 2 cat 9.5 9.9 3 dog 34.0 198.0 In [90]: animals.groupby("kind").agg( ....: min_height=pd.NamedAgg(column="height", aggfunc="min"), ....: max_height=pd.NamedAgg(column="height", aggfunc="max"), ....: average_weight=pd.NamedAgg(column="weight", aggfunc=np.mean), ....: ) ....: Out[90]: min_height max_height average_weight kind cat 9.1 9.5 8.90 dog 6.0 34.0 102.75
或者直接使用一個元組:
In [91]: animals.groupby("kind").agg( ....: min_height=("height", "min"), ....: max_height=("height", "max"), ....: average_weight=("weight", np.mean), ....: ) ....: Out[91]: min_height max_height average_weight kind cat 9.1 9.5 8.90 dog 6.0 34.0 102.75
不同的列指定不同的聚合方法
通過給agg方法傳入一個字典,可以指定不同的列使用不同的聚合:
In [95]: grouped.agg({"C": "sum", "D": "std"}) Out[95]: C D A bar 0.392940 1.366330 foo -1.796421 0.884785
轉換操作
轉換是將對象轉換為同樣大小對象的操作。在數據分析的過程中,經常需要進行數據的轉換操作。
可以接lambda操作:
In [112]: ts.groupby(lambda x: x.year).transform(lambda x: x.max() - x.min())
填充na值:
In [121]: transformed = grouped.transform(lambda x: x.fillna(x.mean()))
過濾操作
filter方法可以通過lambda表達式來過濾我們不需要的數據:
In [136]: sf = pd.Series([1, 1, 2, 3, 3, 3]) In [137]: sf.groupby(sf).filter(lambda x: x.sum() > 2) Out[137]: 3 3 4 3 5 3 dtype: int64
Apply操作
有些數據可能不適合進行聚合或者轉換操作,Pandas提供瞭一個 apply
方法,用來進行更加靈活的轉換操作。
In [156]: df Out[156]: A B C D 0 foo one -0.575247 1.346061 1 bar one 0.254161 1.511763 2 foo two -1.143704 1.627081 3 bar three 0.215897 -0.990582 4 foo two 1.193555 -0.441652 5 bar two -0.077118 1.211526 6 foo one -0.408530 0.268520 7 foo three -0.862495 0.024580 In [157]: grouped = df.groupby("A") # could also just call .describe() In [158]: grouped["C"].apply(lambda x: x.describe()) Out[158]: A bar count 3.000000 mean 0.130980 std 0.181231 min -0.077118 25% 0.069390 ... foo min -1.143704 25% -0.862495 50% -0.575247 75% -0.408530 max 1.193555 Name: C, Length: 16, dtype: float64
可以外接函數:
In [159]: grouped = df.groupby('A')['C'] In [160]: def f(group): .....: return pd.DataFrame({'original': group, .....: 'demeaned': group - group.mean()}) .....: In [161]: grouped.apply(f) Out[161]: original demeaned 0 -0.575247 -0.215962 1 0.254161 0.123181 2 -1.143704 -0.784420 3 0.215897 0.084917 4 1.193555 1.552839 5 -0.077118 -0.208098 6 -0.408530 -0.049245 7 -0.862495 -0.503211
本文已收錄於 http://www.flydean.com/11-python-pandas-groupby/
最通俗的解讀,最深刻的幹貨,最簡潔的教程,眾多你不知道的小技巧等你來發現!
到此這篇關於Pandas高級教程之Pandas中的GroupBy操作的文章就介紹到這瞭,更多相關Pandas GroupBy用法內容請搜索WalkonNet以前的文章或繼續瀏覽下面的相關文章希望大傢以後多多支持WalkonNet!
推薦閱讀:
- pandas中groupby操作實現
- Pandas實現聚合運算agg()的示例代碼
- Python groupby函數圖文詳解
- Pandas groupby apply agg 的區別 運行自定義函數說明
- pandas數據分組groupby()和統計函數agg()的使用