Search code examples
pythonpandaspython-datetime

How to make a histogram of pandas datetimes per specific time interval?


I want to plot some datetimes and would like to specify a time interval in order to bundle them together and make a histogram. So for example, if there happen to be n datetimes in the span of one hour, group them together or parse them as year, month, day, hour. And omit minutes and seconds. Let's say I have a data frame with some datetime values:

import pandas as pd
import numpy as np
from datetime import datetime, timedelta

date_today = datetime.now()
days = pd.date_range(date_today, date_today + timedelta(7), freq='D')

np.random.seed(seed=1111)
data = np.random.randint(1, high=100, size=len(days))
df = pd.DataFrame({'test': days, 'col2': data})
df = df.set_index('test')
print(df)

2018-06-19 17:10:32.076646    29
2018-06-20 17:10:32.076646    56
2018-06-21 17:10:32.076646    82
2018-06-22 17:10:32.076646    13
2018-06-23 17:10:32.076646    35
2018-06-24 17:10:32.076646    53
2018-06-25 17:10:32.076646    25
2018-06-26 17:10:32.076646    23

Ideally, I would like to specify a more flexible time interval, such as "6 hours" in order to make some sort of modulo operation on the datetimes. Is this possible?


Solution

  • pd.Grouper

    Allows you to specify regular frequency intervals with which you will group your data. Use groupby to then aggregate your df based on these groups. For instance, if col2 was counts and you wanted to bin together all of the counts over 2 day intervals, you could do:

    import pandas as pd
    df.groupby(pd.Grouper(level=0, freq='2D')).col2.sum()
    

    Outputs:

    test
    2018-06-19 13:49:11.560185    85
    2018-06-21 13:49:11.560185    95
    2018-06-23 13:49:11.560185    88
    2018-06-25 13:49:11.560185    48
    Name: col2, dtype: int32
    

    You group by level=0, that is your index labeled 'test' and sum col2 over 2 day bins. The behavior of pd.Grouper can be a little annoying since in this example the bins start and end at 13:49:11..., which likely isn't what you want.

    pd.cut + pd.date_range

    You have a bit more control over defining your bins if you define them with pd.date_range and then use pd.cut. Here for instance, you can define bins every 2 days beginning on the 19th.

    df.groupby(pd.cut(df.index, 
                      pd.date_range('2018-06-19', '2018-06-27', freq='2D'))).col2.sum()
    

    Outputs:

    (2018-06-19, 2018-06-21]    85
    (2018-06-21, 2018-06-23]    95
    (2018-06-23, 2018-06-25]    88
    (2018-06-25, 2018-06-27]    48
    Name: col2, dtype: int32
    

    This is nice, because if you instead wanted the bins to begin on even days you can just change the start and end dates in pd.date_range

    df.groupby(pd.cut(df.index, 
                      pd.date_range('2018-06-18', '2018-06-28', freq='2D'))).col2.sum()
    

    Outputs:

    (2018-06-18, 2018-06-20]     29
    (2018-06-20, 2018-06-22]    138
    (2018-06-22, 2018-06-24]     48
    (2018-06-24, 2018-06-26]     78
    (2018-06-26, 2018-06-28]     23
    Name: col2, dtype: int32
    

    If you really wanted to, you could specify 2.6 hour bins beginning on June 19th 2018 at 5 AM:

    df.groupby(pd.cut(df.index, 
                      pd.date_range('2018-06-19 5:00:00', '2018-06-28 5:00:00', freq='2.6H'))).col2.sum()
    #(2018-06-19 05:00:00, 2018-06-19 07:36:00]     0
    #(2018-06-19 07:36:00, 2018-06-19 10:12:00]     0
    #(2018-06-19 10:12:00, 2018-06-19 12:48:00]     0
    #(2018-06-19 12:48:00, 2018-06-19 15:24:00]    29
    #....
    

    Histogram.

    Just use .plot(kind='bar') after you have aggregated the data.

    (df.groupby(pd.cut(df.index, 
                       pd.date_range('2018-06-19', '2018-06-28', freq='2D')))
       .col2.sum().plot(kind='bar', color='firebrick', rot=30))
    

    enter image description here