I want to plot some datetimes and would like to specify a time interval in order to bundle them together and make a histogram. So for example, if there happen to be n datetimes in the span of one hour, group them together or parse them as year, month, day, hour. And omit minutes and seconds. Let's say I have a data frame with some datetime values:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
date_today = datetime.now()
days = pd.date_range(date_today, date_today + timedelta(7), freq='D')
np.random.seed(seed=1111)
data = np.random.randint(1, high=100, size=len(days))
df = pd.DataFrame({'test': days, 'col2': data})
df = df.set_index('test')
print(df)
2018-06-19 17:10:32.076646 29
2018-06-20 17:10:32.076646 56
2018-06-21 17:10:32.076646 82
2018-06-22 17:10:32.076646 13
2018-06-23 17:10:32.076646 35
2018-06-24 17:10:32.076646 53
2018-06-25 17:10:32.076646 25
2018-06-26 17:10:32.076646 23
Ideally, I would like to specify a more flexible time interval, such as "6 hours" in order to make some sort of modulo operation on the datetimes. Is this possible?
pd.Grouper
Allows you to specify regular frequency intervals with which you will group your data. Use groupby
to then aggregate your df
based on these groups. For instance, if col2
was counts and you wanted to bin together all of the counts over 2 day intervals, you could do:
import pandas as pd
df.groupby(pd.Grouper(level=0, freq='2D')).col2.sum()
Outputs:
test
2018-06-19 13:49:11.560185 85
2018-06-21 13:49:11.560185 95
2018-06-23 13:49:11.560185 88
2018-06-25 13:49:11.560185 48
Name: col2, dtype: int32
You group by level=0
, that is your index labeled 'test'
and sum col2
over 2 day bins. The behavior of pd.Grouper
can be a little annoying since in this example the bins start and end at 13:49:11..., which likely isn't what you want.
pd.cut
+ pd.date_range
You have a bit more control over defining your bins if you define them with pd.date_range
and then use pd.cut
. Here for instance, you can define bins every 2 days beginning on the 19th.
df.groupby(pd.cut(df.index,
pd.date_range('2018-06-19', '2018-06-27', freq='2D'))).col2.sum()
Outputs:
(2018-06-19, 2018-06-21] 85
(2018-06-21, 2018-06-23] 95
(2018-06-23, 2018-06-25] 88
(2018-06-25, 2018-06-27] 48
Name: col2, dtype: int32
This is nice, because if you instead wanted the bins to begin on even days you can just change the start and end dates in pd.date_range
df.groupby(pd.cut(df.index,
pd.date_range('2018-06-18', '2018-06-28', freq='2D'))).col2.sum()
Outputs:
(2018-06-18, 2018-06-20] 29
(2018-06-20, 2018-06-22] 138
(2018-06-22, 2018-06-24] 48
(2018-06-24, 2018-06-26] 78
(2018-06-26, 2018-06-28] 23
Name: col2, dtype: int32
If you really wanted to, you could specify 2.6 hour bins beginning on June 19th 2018 at 5 AM:
df.groupby(pd.cut(df.index,
pd.date_range('2018-06-19 5:00:00', '2018-06-28 5:00:00', freq='2.6H'))).col2.sum()
#(2018-06-19 05:00:00, 2018-06-19 07:36:00] 0
#(2018-06-19 07:36:00, 2018-06-19 10:12:00] 0
#(2018-06-19 10:12:00, 2018-06-19 12:48:00] 0
#(2018-06-19 12:48:00, 2018-06-19 15:24:00] 29
#....
Just use .plot(kind='bar')
after you have aggregated the data.
(df.groupby(pd.cut(df.index,
pd.date_range('2018-06-19', '2018-06-28', freq='2D')))
.col2.sum().plot(kind='bar', color='firebrick', rot=30))