Search code examples
pythonpandasgroup-bytime-seriespd.grouper

Grouping in minutes intervals starting at fixed time regardless the first row time and the date


I have this code intended to count occurrences in 30-minute intervals; the requirement is to have these intervals at fixed starting points, minute 00 and minute 30 of each hour. Regretfully, despite every attempt of mine, the second group is aligned to minute 03 and minute 33.

I suspect that both the groups are aligned with the first time row and that the first one is correct just by chance. How can I tell the grouper to force the alignment to minutes 00 and 30?

# load and prepare data
df_long_forecast = pd.read_csv('df_long_forecast - reduced.csv')
df_long_forecast['after_12_max_datetime'] = pd.to_datetime(
    df_long_forecast['after_12_max_datetime']
)
df_long_forecast['after_12_max_time'] = (
    df_long_forecast['after_12_max_datetime']
    - df_long_forecast['after_12_max_datetime'].dt.normalize()
)  # timedelta64[ns]

# count the number of maxes happening after 12am (absolute and percentage)
hist_max = df_long_forecast.groupby(
    pd.Grouper(key='after_12_max_time', freq='30T', offset='0T', origin='epoch')
)['Date'].count()
display(hist_max)

# count the number of maxes after 12 that result in a profit trade
# (cannot be MORE than the previous ones)
df_long_forecast_profit = df_long_forecast[
    df_long_forecast['after_12_max_>_9_to_12_high'] > 0
]
profit_long = df_long_forecast_profit.groupby(
    pd.Grouper(key='after_12_max_time', freq='30T', offset='0T', origin='epoch')
)['Date'].count()
display(profit_long)

Here is hist_max

after_12_max_time
0 days 12:00:00    24
0 days 12:30:00     5
0 days 13:00:00     7
0 days 13:30:00     5
0 days 14:00:00     5
0 days 14:30:00     4
0 days 15:00:00     4
0 days 15:30:00     1
0 days 16:00:00     5
0 days 16:30:00     7
0 days 17:00:00     1
0 days 17:30:00     6
0 days 18:00:00     1
0 days 18:30:00     1
0 days 19:00:00     1
0 days 19:30:00     6
0 days 20:00:00     3
0 days 20:30:00     0
0 days 21:00:00     6
0 days 21:30:00    19
0 days 22:00:00     8
Freq: 30T, Name: Date, dtype: int64

and this is profit_long

after_12_max_time
0 days 12:03:00     8
0 days 12:33:00     4
0 days 13:03:00     5
0 days 13:33:00     4
0 days 14:03:00     5
0 days 14:33:00     4
0 days 15:03:00     3
0 days 15:33:00     2
0 days 16:03:00     5
0 days 16:33:00     6
0 days 17:03:00     2
0 days 17:33:00     5
0 days 18:03:00     1
0 days 18:33:00     2
0 days 19:03:00     0
0 days 19:33:00     5
0 days 20:03:00     3
0 days 20:33:00     0
0 days 21:03:00     6
0 days 21:33:00    21
0 days 22:03:00     3
Freq: 30T, Name: Date, dtype: int64

The CSV file with just the pertinent columns can be downloaded from this link.


Solution

  • This seems to be a bug in Pandas, because if you convert the timedeltas to timestamps (based on, say, the first day), you get the expected minutes.

    The issue also affects resample, so I'm using it here for brevity.

    Also, offset='0T' doesn't do anything as far as I'm aware, so I'm leaving it off.

    # Base the time
    first_day = df_long_forecast['after_12_max_datetime'].min().normalize()
    df_long_forecast_profit['after_12_max_time_based'] = (
        df_long_forecast['after_12_max_time'] + first_day
    )
    
    profit_long_based = df_long_forecast_profit.resample(
        on='after_12_max_time_based',
        rule='30T',
        origin='epoch'
    )['Date'].count()
    profit_long_based
    
    after_12_max_time_based
    2020-01-02 12:00:00     8
    2020-01-02 12:30:00     3
    2020-01-02 13:00:00     6
                           ..
    2020-01-02 21:00:00     6
    2020-01-02 21:30:00    17
    2020-01-02 22:00:00     7
    Freq: 30T, Name: Date, Length: 21, dtype: int64
    

    So as a workaround, you can do this then subtract the day again.

    profit_long = profit_long_based.set_axis(
        (profit_long_based.index - first_day).rename('after_12_max_time')
    )
    profit_long
    
    after_12_max_time
    0 days 12:00:00     8
    0 days 12:30:00     3
    0 days 13:00:00     6
                       ..
    0 days 21:00:00     6
    0 days 21:30:00    17
    0 days 22:00:00     7
    Freq: 30T, Name: Date, Length: 21, dtype: int64
    

    Additional bugs?

    I played around with the resample and it looks like the closed and label parameters also have no effect on timedeltas, while they work fine for datetimes. For example, label='right':

    df_long_forecast_profit.resample(
        on='after_12_max_time',
        rule='30T',
        origin='epoch',
        label='right'
    )['Date'].count()
    
    after_12_max_time
    0 days 12:03:00     8
    0 days 12:33:00     4
    0 days 13:03:00     5
                       ..
    0 days 21:03:00     6
    0 days 21:33:00    21
    0 days 22:03:00     3
    Freq: 30T, Name: Date, Length: 21, dtype: int64
    

    vs

    df_long_forecast_profit.resample(
        on='after_12_max_time_based',
        rule='30T',
        origin='epoch',
        label='right'
    )['Date'].count()
    
    after_12_max_time_based
    2020-01-02 12:30:00     8
    2020-01-02 13:00:00     3
    2020-01-02 13:30:00     6
                           ..
    2020-01-02 21:30:00     6
    2020-01-02 22:00:00    17
    2020-01-02 22:30:00     7
    Freq: 30T, Name: Date, Length: 21, dtype: int64