Python Pandas高級教程之時間處理
簡介
時間應該是在數據處理中經常會用到的一種數據類型,除瞭Numpy中datetime64 和 timedelta64 這兩種數據類型之外,pandas 還整合瞭其他python庫比如 scikits.timeseries 中的功能。
時間分類
pandas中有四種時間類型:
- Date times : 日期和時間,可以帶時區。和標準庫中的 datetime.datetime 類似。
- Time deltas: 絕對持續時間,和 標準庫中的 datetime.timedelta 類似。
- Time spans: 由時間點及其關聯的頻率定義的時間跨度。
- Date offsets:基於日歷計算的時間 和 dateutil.relativedelta.relativedelta 類似。
我們用一張表來表示:
類型 | 標量class | 數組class | pandas數據類型 | 主要創建方法 |
---|---|---|---|---|
Date times | Timestamp | DatetimeIndex | datetime64[ns] or datetime64[ns, tz] | to_datetime or date_range |
Time deltas | Timedelta | TimedeltaIndex | timedelta64[ns] | to_timedelta or timedelta_range |
Time spans | Period | PeriodIndex | period[freq] | Period or period_range |
Date offsets | DateOffset | None | None | DateOffset |
看一個使用的例子:
In [19]: pd.Series(range(3), index=pd.date_range("2000", freq="D", periods=3)) Out[19]: 2000-01-01 0 2000-01-02 1 2000-01-03 2 Freq: D, dtype: int64
看一下上面數據類型的空值:
In [24]: pd.Timestamp(pd.NaT) Out[24]: NaT In [25]: pd.Timedelta(pd.NaT) Out[25]: NaT In [26]: pd.Period(pd.NaT) Out[26]: NaT # Equality acts as np.nan would In [27]: pd.NaT == pd.NaT Out[27]: False
Timestamp
Timestamp 是最基礎的時間類型,我們可以這樣創建:
In [28]: pd.Timestamp(datetime.datetime(2012, 5, 1)) Out[28]: Timestamp('2012-05-01 00:00:00') In [29]: pd.Timestamp("2012-05-01") Out[29]: Timestamp('2012-05-01 00:00:00') In [30]: pd.Timestamp(2012, 5, 1) Out[30]: Timestamp('2012-05-01 00:00:00')
DatetimeIndex
Timestamp 作為index會自動被轉換為DatetimeIndex:
In [33]: dates = [ ....: pd.Timestamp("2012-05-01"), ....: pd.Timestamp("2012-05-02"), ....: pd.Timestamp("2012-05-03"), ....: ] ....: In [34]: ts = pd.Series(np.random.randn(3), dates) In [35]: type(ts.index) Out[35]: pandas.core.indexes.datetimes.DatetimeIndex In [36]: ts.index Out[36]: DatetimeIndex(['2012-05-01', '2012-05-02', '2012-05-03'], dtype='datetime64[ns]', freq=None) In [37]: ts Out[37]: 2012-05-01 0.469112 2012-05-02 -0.282863 2012-05-03 -1.509059 dtype: float64
date_range 和 bdate_range
還可以使用 date_range 來創建DatetimeIndex:
In [74]: start = datetime.datetime(2011, 1, 1) In [75]: end = datetime.datetime(2012, 1, 1) In [76]: index = pd.date_range(start, end) In [77]: index Out[77]: DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03', '2011-01-04', '2011-01-05', '2011-01-06', '2011-01-07', '2011-01-08', '2011-01-09', '2011-01-10', ... '2011-12-23', '2011-12-24', '2011-12-25', '2011-12-26', '2011-12-27', '2011-12-28', '2011-12-29', '2011-12-30', '2011-12-31', '2012-01-01'], dtype='datetime64[ns]', length=366, freq='D')
date_range 是日歷范圍,bdate_range 是工作日范圍:
In [78]: index = pd.bdate_range(start, end) In [79]: index Out[79]: DatetimeIndex(['2011-01-03', '2011-01-04', '2011-01-05', '2011-01-06', '2011-01-07', '2011-01-10', '2011-01-11', '2011-01-12', '2011-01-13', '2011-01-14', ... '2011-12-19', '2011-12-20', '2011-12-21', '2011-12-22', '2011-12-23', '2011-12-26', '2011-12-27', '2011-12-28', '2011-12-29', '2011-12-30'], dtype='datetime64[ns]', length=260, freq='B')
兩個方法都可以帶上 start, end, 和 periods 參數。
In [84]: pd.bdate_range(end=end, periods=20) In [83]: pd.date_range(start, end, freq="W") In [86]: pd.date_range("2018-01-01", "2018-01-05", periods=5)
origin
使用 origin參數,可以修改 DatetimeIndex 的起點:
In [67]: pd.to_datetime([1, 2, 3], unit="D", origin=pd.Timestamp("1960-01-01")) Out[67]: DatetimeIndex(['1960-01-02', '1960-01-03', '1960-01-04'], dtype='datetime64[ns]', freq=None)
默認情況下 origin=’unix’, 也就是起點是 1970-01-01 00:00:00.
In [68]: pd.to_datetime([1, 2, 3], unit="D") Out[68]: DatetimeIndex(['1970-01-02', '1970-01-03', '1970-01-04'], dtype='datetime64[ns]', freq=None)
格式化
使用format參數可以對時間進行格式化:
In [51]: pd.to_datetime("2010/11/12", format="%Y/%m/%d") Out[51]: Timestamp('2010-11-12 00:00:00') In [52]: pd.to_datetime("12-11-2010 00:00", format="%d-%m-%Y %H:%M") Out[52]: Timestamp('2010-11-12 00:00:00')
Period
Period 表示的是一個時間跨度,通常和freq一起使用:
In [31]: pd.Period("2011-01") Out[31]: Period('2011-01', 'M') In [32]: pd.Period("2012-05", freq="D") Out[32]: Period('2012-05-01', 'D')
Period可以直接進行運算:
In [345]: p = pd.Period("2012", freq="A-DEC") In [346]: p + 1 Out[346]: Period('2013', 'A-DEC') In [347]: p - 3 Out[347]: Period('2009', 'A-DEC') In [348]: p = pd.Period("2012-01", freq="2M") In [349]: p + 2 Out[349]: Period('2012-05', '2M') In [350]: p - 1 Out[350]: Period('2011-11', '2M')
註意,Period隻有具有相同的freq才能進行算數運算。包括 offsets 和 timedelta
In [352]: p = pd.Period("2014-07-01 09:00", freq="H") In [353]: p + pd.offsets.Hour(2) Out[353]: Period('2014-07-01 11:00', 'H') In [354]: p + datetime.timedelta(minutes=120) Out[354]: Period('2014-07-01 11:00', 'H') In [355]: p + np.timedelta64(7200, "s") Out[355]: Period('2014-07-01 11:00', 'H')
Period作為index可以自動被轉換為PeriodIndex:
In [38]: periods = [pd.Period("2012-01"), pd.Period("2012-02"), pd.Period("2012-03")] In [39]: ts = pd.Series(np.random.randn(3), periods) In [40]: type(ts.index) Out[40]: pandas.core.indexes.period.PeriodIndex In [41]: ts.index Out[41]: PeriodIndex(['2012-01', '2012-02', '2012-03'], dtype='period[M]', freq='M') In [42]: ts Out[42]: 2012-01 -1.135632 2012-02 1.212112 2012-03 -0.173215 Freq: M, dtype: float64
可以通過 pd.period_range 方法來創建 PeriodIndex:
In [359]: prng = pd.period_range("1/1/2011", "1/1/2012", freq="M") In [360]: prng Out[360]: PeriodIndex(['2011-01', '2011-02', '2011-03', '2011-04', '2011-05', '2011-06', '2011-07', '2011-08', '2011-09', '2011-10', '2011-11', '2011-12', '2012-01'], dtype='period[M]', freq='M')
還可以通過PeriodIndex直接創建:
In [361]: pd.PeriodIndex(["2011-1", "2011-2", "2011-3"], freq="M") Out[361]: PeriodIndex(['2011-01', '2011-02', '2011-03'], dtype='period[M]', freq='M')
DateOffset
DateOffset表示的是頻率對象。它和Timedelta很類似,表示的是一個持續時間,但是有特殊的日歷規則。比如Timedelta一天肯定是24小時,而在 DateOffset中根據夏令時的不同,一天可能會有23,24或者25小時。
# This particular day contains a day light savings time transition In [144]: ts = pd.Timestamp("2016-10-30 00:00:00", tz="Europe/Helsinki") # Respects absolute time In [145]: ts + pd.Timedelta(days=1) Out[145]: Timestamp('2016-10-30 23:00:00+0200', tz='Europe/Helsinki') # Respects calendar time In [146]: ts + pd.DateOffset(days=1) Out[146]: Timestamp('2016-10-31 00:00:00+0200', tz='Europe/Helsinki') In [147]: friday = pd.Timestamp("2018-01-05") In [148]: friday.day_name() Out[148]: 'Friday' # Add 2 business days (Friday --> Tuesday) In [149]: two_business_days = 2 * pd.offsets.BDay() In [150]: two_business_days.apply(friday) Out[150]: Timestamp('2018-01-09 00:00:00') In [151]: friday + two_business_days Out[151]: Timestamp('2018-01-09 00:00:00') In [152]: (friday + two_business_days).day_name() Out[152]: 'Tuesday'
DateOffsets 和Frequency 運算是先關的,看一下可用的Date Offset 和它相關聯的 Frequency:
Date Offset | Frequency String | 描述 |
---|---|---|
DateOffset | None | 通用的offset 類 |
BDay or BusinessDay | ‘B’ | 工作日 |
CDay or CustomBusinessDay | ‘C’ | 自定義的工作日 |
Week | ‘W’ | 一周 |
WeekOfMonth | ‘WOM’ | 每個月的第幾周的第幾天 |
LastWeekOfMonth | ‘LWOM’ | 每個月最後一周的第幾天 |
MonthEnd | ‘M’ | 日歷月末 |
MonthBegin | ‘MS’ | 日歷月初 |
BMonthEnd or BusinessMonthEnd | ‘BM’ | 營業月底 |
BMonthBegin or BusinessMonthBegin | ‘BMS’ | 營業月初 |
CBMonthEnd or CustomBusinessMonthEnd | ‘CBM’ | 自定義營業月底 |
CBMonthBegin or CustomBusinessMonthBegin | ‘CBMS’ | 自定義營業月初 |
SemiMonthEnd | ‘SM’ | 日歷月末的第15天 |
SemiMonthBegin | ‘SMS’ | 日歷月初的第15天 |
QuarterEnd | ‘Q’ | 日歷季末 |
QuarterBegin | ‘QS’ | 日歷季初 |
BQuarterEnd | ‘BQ | 工作季末 |
BQuarterBegin | ‘BQS’ | 工作季初 |
FY5253Quarter | ‘REQ’ | 零售季( 52-53 week) |
YearEnd | ‘A’ | 日歷年末 |
YearBegin | ‘AS’ or ‘BYS’ | 日歷年初 |
BYearEnd | ‘BA’ | 營業年末 |
BYearBegin | ‘BAS’ | 營業年初 |
FY5253 | ‘RE’ | 零售年 (aka 52-53 week) |
Easter | None | 復活節假期 |
BusinessHour | ‘BH’ | business hour |
CustomBusinessHour | ‘CBH’ | custom business hour |
Day | ‘D’ | 一天的絕對時間 |
Hour | ‘H’ | 一小時 |
Minute | ‘T’ or ‘min’ | 一分鐘 |
Second | ‘S’ | 一秒鐘 |
Milli | ‘L’ or ‘ms’ | 一微妙 |
Micro | ‘U’ or ‘us’ | 一毫秒 |
Nano | ‘N’ | 一納秒 |
DateOffset還有兩個方法 rollforward() 和 rollback() 可以將時間進行移動:
In [153]: ts = pd.Timestamp("2018-01-06 00:00:00") In [154]: ts.day_name() Out[154]: 'Saturday' # BusinessHour's valid offset dates are Monday through Friday In [155]: offset = pd.offsets.BusinessHour(start="09:00") # Bring the date to the closest offset date (Monday) In [156]: offset.rollforward(ts) Out[156]: Timestamp('2018-01-08 09:00:00') # Date is brought to the closest offset date first and then the hour is added In [157]: ts + offset Out[157]: Timestamp('2018-01-08 10:00:00')
上面的操作會自動保存小時,分鐘等信息,如果想要設置為 00:00:00 , 可以調用normalize() 方法:
In [158]: ts = pd.Timestamp("2014-01-01 09:00") In [159]: day = pd.offsets.Day() In [160]: day.apply(ts) Out[160]: Timestamp('2014-01-02 09:00:00') In [161]: day.apply(ts).normalize() Out[161]: Timestamp('2014-01-02 00:00:00') In [162]: ts = pd.Timestamp("2014-01-01 22:00") In [163]: hour = pd.offsets.Hour() In [164]: hour.apply(ts) Out[164]: Timestamp('2014-01-01 23:00:00') In [165]: hour.apply(ts).normalize() Out[165]: Timestamp('2014-01-01 00:00:00') In [166]: hour.apply(pd.Timestamp("2014-01-01 23:30")).normalize() Out[166]: Timestamp('2014-01-02 00:00:00')
作為index
時間可以作為index,並且作為index的時候會有一些很方便的特性。
可以直接使用時間來獲取相應的數據:
In [99]: ts["1/31/2011"] Out[99]: 0.11920871129693428 In [100]: ts[datetime.datetime(2011, 12, 25):] Out[100]: 2011-12-30 0.56702 Freq: BM, dtype: float64 In [101]: ts["10/31/2011":"12/31/2011"] Out[101]: 2011-10-31 0.271860 2011-11-30 -0.424972 2011-12-30 0.567020 Freq: BM, dtype: float64
獲取全年的數據:
In [102]: ts["2011"] Out[102]: 2011-01-31 0.119209 2011-02-28 -1.044236 2011-03-31 -0.861849 2011-04-29 -2.104569 2011-05-31 -0.494929 2011-06-30 1.071804 2011-07-29 0.721555 2011-08-31 -0.706771 2011-09-30 -1.039575 2011-10-31 0.271860 2011-11-30 -0.424972 2011-12-30 0.567020 Freq: BM, dtype: float64
獲取某個月的數據:
In [103]: ts["2011-6"] Out[103]: 2011-06-30 1.071804 Freq: BM, dtype: float64
DF可以接受時間作為loc的參數:
In [105]: dft Out[105]: A 2013-01-01 00:00:00 0.276232 2013-01-01 00:01:00 -1.087401 2013-01-01 00:02:00 -0.673690 2013-01-01 00:03:00 0.113648 2013-01-01 00:04:00 -1.478427 ... ... 2013-03-11 10:35:00 -0.747967 2013-03-11 10:36:00 -0.034523 2013-03-11 10:37:00 -0.201754 2013-03-11 10:38:00 -1.509067 2013-03-11 10:39:00 -1.693043 [100000 rows x 1 columns] In [106]: dft.loc["2013"] Out[106]: A 2013-01-01 00:00:00 0.276232 2013-01-01 00:01:00 -1.087401 2013-01-01 00:02:00 -0.673690 2013-01-01 00:03:00 0.113648 2013-01-01 00:04:00 -1.478427 ... ... 2013-03-11 10:35:00 -0.747967 2013-03-11 10:36:00 -0.034523 2013-03-11 10:37:00 -0.201754 2013-03-11 10:38:00 -1.509067 2013-03-11 10:39:00 -1.693043 [100000 rows x 1 columns]
時間切片:
In [107]: dft["2013-1":"2013-2"] Out[107]: A 2013-01-01 00:00:00 0.276232 2013-01-01 00:01:00 -1.087401 2013-01-01 00:02:00 -0.673690 2013-01-01 00:03:00 0.113648 2013-01-01 00:04:00 -1.478427 ... ... 2013-02-28 23:55:00 0.850929 2013-02-28 23:56:00 0.976712 2013-02-28 23:57:00 -2.693884 2013-02-28 23:58:00 -1.575535 2013-02-28 23:59:00 -1.573517 [84960 rows x 1 columns]
切片和完全匹配
考慮下面的一個精度為分的Series對象:
In [120]: series_minute = pd.Series( .....: [1, 2, 3], .....: pd.DatetimeIndex( .....: ["2011-12-31 23:59:00", "2012-01-01 00:00:00", "2012-01-01 00:02:00"] .....: ), .....: ) .....: In [121]: series_minute.index.resolution Out[121]: 'minute'
時間精度小於分的話,返回的是一個Series對象:
In [122]: series_minute["2011-12-31 23"] Out[122]: 2011-12-31 23:59:00 1 dtype: int64
時間精度大於分的話,返回的是一個常量:
In [123]: series_minute["2011-12-31 23:59"] Out[123]: 1 In [124]: series_minute["2011-12-31 23:59:00"] Out[124]: 1
同樣的,如果精度為秒的話,小於秒會返回一個對象,等於秒會返回常量值。
時間序列的操作
Shifting
使用shift方法可以讓 time series 進行相應的移動:
In [275]: ts = pd.Series(range(len(rng)), index=rng) In [276]: ts = ts[:5] In [277]: ts.shift(1) Out[277]: 2012-01-01 NaN 2012-01-02 0.0 2012-01-03 1.0 Freq: D, dtype: float64
通過指定 freq , 可以設置shift的方式:
In [278]: ts.shift(5, freq="D") Out[278]: 2012-01-06 0 2012-01-07 1 2012-01-08 2 Freq: D, dtype: int64 In [279]: ts.shift(5, freq=pd.offsets.BDay()) Out[279]: 2012-01-06 0 2012-01-09 1 2012-01-10 2 dtype: int64 In [280]: ts.shift(5, freq="BM") Out[280]: 2012-05-31 0 2012-05-31 1 2012-05-31 2 dtype: int64
頻率轉換
時間序列可以通過調用 asfreq 的方法轉換其頻率:
In [281]: dr = pd.date_range("1/1/2010", periods=3, freq=3 * pd.offsets.BDay()) In [282]: ts = pd.Series(np.random.randn(3), index=dr) In [283]: ts Out[283]: 2010-01-01 1.494522 2010-01-06 -0.778425 2010-01-11 -0.253355 Freq: 3B, dtype: float64 In [284]: ts.asfreq(pd.offsets.BDay()) Out[284]: 2010-01-01 1.494522 2010-01-04 NaN 2010-01-05 NaN 2010-01-06 -0.778425 2010-01-07 NaN 2010-01-08 NaN 2010-01-11 -0.253355 Freq: B, dtype: float64
asfreq還可以指定修改頻率過後的填充方法:
In [285]: ts.asfreq(pd.offsets.BDay(), method="pad") Out[285]: 2010-01-01 1.494522 2010-01-04 1.494522 2010-01-05 1.494522 2010-01-06 -0.778425 2010-01-07 -0.778425 2010-01-08 -0.778425 2010-01-11 -0.253355 Freq: B, dtype: float64
Resampling 重新取樣
給定的時間序列可以通過調用resample方法來重新取樣:
In [286]: rng = pd.date_range("1/1/2012", periods=100, freq="S") In [287]: ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng) In [288]: ts.resample("5Min").sum() Out[288]: 2012-01-01 25103 Freq: 5T, dtype: int64
resample 可以接受各類統計方法,比如: sum, mean, std, sem, max, min, median, first, last, ohlc。
In [289]: ts.resample("5Min").mean() Out[289]: 2012-01-01 251.03 Freq: 5T, dtype: float64 In [290]: ts.resample("5Min").ohlc() Out[290]: open high low close 2012-01-01 308 460 9 205 In [291]: ts.resample("5Min").max() Out[291]: 2012-01-01 460 Freq: 5T, dtype: int64
總結
到此這篇關於Python Pandas高級教程之時間處理的文章就介紹到這瞭,更多相關Pandas時間處理內容請搜索WalkonNet以前的文章或繼續瀏覽下面的相關文章希望大傢以後多多支持WalkonNet!
推薦閱讀:
- Pandas如何將Timestamp轉為datetime類型
- python Pandas時序數據處理
- Pandas數據分析固定時間點和時間差
- Python pandas索引的設置和修改方法
- Python中的pandas表格模塊、文件模塊和數據庫模塊