Search code examples
pythonpandasgroup-bybinning

Binning in pandas starting from specific date


I am trying to bin values according to dates. The dataframe looks like this

        type     event_date
43851   MEDIUM 2017-10-09 13:28:33
43852     HIGH 2017-10-09 14:19:49
43853     HIGH 2017-10-09 14:23:25
43854     HIGH 2017-10-09 14:24:18
43855   MEDIUM 2017-10-09 14:25:31
43856      LOW 2017-10-09 14:25:33
43857   MEDIUM 2017-10-09 14:25:33
43858      LOW 2017-10-09 14:25:38

I would like to bin from the a specific date and count occurrences of every type every half an hour. I tried with

grouper = df.groupby([pd.Grouper(freq='30T',key='event_date'), 'type'])
grouper['other_col'].count()

which does almost exactly what I want

event_date           type
2017-10-09 13:00:00  MEDIUM      1
2017-10-09 14:00:00  HIGH        3
                     LOW         2
                     MEDIUM      2

I would like to

  1. Start from specified hour (in my case it would be first occurrence of LOW - 12 hours = 02:25:33) and not the first available hour rounded down.
  2. Display also the empty intervals

Solution

  • You can use pd.cut

    starting_hour = (df[df.type=='LOW'].head(1).event_date - dt.timedelta(hours=12)).item()
    intervals = pd.cut(df.event_date, pd.date_range(start=starting_hour , freq='30T', periods=49))
    
    43851    (2017-10-09 13:25:33, 2017-10-09 13:55:33]
    43852    (2017-10-09 13:55:33, 2017-10-09 14:25:33]
    43853    (2017-10-09 13:55:33, 2017-10-09 14:25:33]
    43854    (2017-10-09 13:55:33, 2017-10-09 14:25:33]
    43855    (2017-10-09 13:55:33, 2017-10-09 14:25:33]
    43856    (2017-10-09 13:55:33, 2017-10-09 14:25:33]
    43857    (2017-10-09 13:55:33, 2017-10-09 14:25:33]
    43858    (2017-10-09 14:25:33, 2017-10-09 14:55:33]
    

    to include only left values, you can

    df['i'] = intervals.transform(lambda k: k.left)
    
    43851   2017-10-09 13:25:33
    43852   2017-10-09 13:55:33
    43853   2017-10-09 13:55:33
    43854   2017-10-09 13:55:33
    43855   2017-10-09 13:55:33
    43856   2017-10-09 13:55:33
    43857   2017-10-09 13:55:33
    43858   2017-10-09 14:25:33
    

    Then you can groupby intervals and use count()

    df.groupby(['i', 'type']).count()
    

    Just notice that you are using 30 minutes interval, so there will be lots of empty rows.