Search code examples
pythonpandasgroup-bytime-seriesdatetimeindex

pandas bucket timestamp into TimeGrouper frequency group


I have a data frame in pandas with a DateTime index. When grouping it with a time grouper: pd.Grouper(freq='360Min'), how can I join this result back onto the original timestamp? I.e. an equijoin timestamp=bucket will not work? Is there a convenience function? Should the asof join be used? Or would I manually have to extract the hours and then try to match it up?

example:

for a source of

import pandas as pd
df = pd.DataFrame(
   {
       "Publish date": [
            pd.Timestamp("2000-01-02"),
            pd.Timestamp("2000-01-02"),
            pd.Timestamp("2000-01-09"),
            pd.Timestamp("2000-01-16")
        ],
        "ID": [0, 1, 2, 3],
        "Price": [10, 20, 30, 40]
    }
)

Which gives:

  Publish date  ID  Price
0   2000-01-02   0     10
1   2000-01-02   1     20
2   2000-01-09   2     30
3   2000-01-16   3     40

I want to perform an aggregation with an arbitrary frequency (not only month, day, hour), let's say 1

month.

agg_result = df.groupby(pd.Grouper(key="Publish date", freq="1M")).agg([pd.Series.mean, pd.Series.median]).reset_index()
agg_result.columns = ['_'.join(col).strip() for col in agg_result.columns.values]
agg_result.columns = ['Publish date month', 'ID_mean', 'ID_median', 'Price_mean', 'Price_median']
print(agg_result)
Publish date month  ID_mean  ID_median  Price_mean  Price_median
0         2000-01-31      1.5        1.5          25            25

How can I ensure that the equijoin would work again? I.e. transform the original timestamp into the fitting bucket using the same arbitrary frequency?

I.e. described in the code of the example, how can I get:

agg_result['Publish date month'] = agg_result['Publish date'].apply(magic transform to same frequency bucket)
df.merge(agg_result, on['Publish date month'])

To work, i.e. define the transformation to the right bucket?


Solution

  • EDIT:

    The easiest way to identify the corresponding original values for each group should be:

    gb = df.groupby(pd.Grouper(key="Publish date", freq="1M"))
    dict(list(gb['Publish date']))
    

    You can then use this to join any information back to the original table.


    Can you just join on two intermediate columns?

    df['Publish date'].dt.month
    

    and

    df.groupby(pd.Grouper(key="Publish date", freq="1M")).agg([pd.Series.mean, pd.Series.median]).index.month
    

    like this

    results =  df.groupby(pd.Grouper(key="Publish date", freq="1M")).agg([pd.Series.mean, pd.Series.median])
    
    results.columns = ['-'.join(col[::-1]).strip() for col in results.columns]
    
    df['month'] = df['Publish date'].dt.month
    
    results['month'] = results.index.month
    results.merge(df)