Search code examples
pythonpandastime-seriesresampling

Pandas time series resample, binning seems off


I was answering another question here with something about pandas I thought to know, time series resampling, when I noticed this odd binning.

Let's say I have a dataframe with a daily date range index and a column I want to resample and sum on.

index = pd.date_range(start="1/1/2018", end="31/12/2018") 
df = pd.DataFrame(np.random.randint(100, size=len(index)), 
                  columns=["sales"], index=index)

>>> df.head()
            sales
2018-01-01     66
2018-01-02     18
2018-01-03     45
2018-01-04     92
2018-01-05     76

Now I resample by one month, everything looks fine:

>>>df.resample("1M").sum()

            sales
2018-01-31   1507
2018-02-28   1186
2018-03-31   1382
[...]
2018-11-30   1342
2018-12-31   1337

If I try to resample by more months though binning starts to look off. This is particularly evident with 6M

df.resample("6M").sum()                                                           
            sales
2018-01-31   1507
2018-07-31   8393
2019-01-31   7283

First bin spans just over one month, last bin goes one month to the future. Maybe I have to set closed="left" to get the proper limits:

df.resample("6M", closed="left").sum()                                            
            sales
2018-06-30   8090
2018-12-31   9054
2019-06-30     39

Now I have an extra bin in 2019 with data from 2018-12-31...

Is this working properly? am I missing any option I should set?

EDIT: here's the output I would expect resampling one year in six month intervals, first interval spanning from Jan 1st to Jun 30, second interval spanning from Jul 1st to Dec 31.

df.resample("6M", closed="left").sum()                                            
            sales
2018-06-30   8090
2018-12-31   9093 # 9054 + 39

Note that there's also some doubt here about what it's happening with June 30 data, does it go in the first bin like I would expect or the second? I mean with the last bin it's evident but the same is probably happening in all the bins.


Solution

  • The M time offset alias implies month end frequency. What you need is 6MS which is an alias for month start frequency:

    df.resample('6MS').sum()
    

    resulting in

                sales
    2018-01-01   8130
    2018-07-01   9563
    2019-01-01      0
    

    Also df.groupby(pd.Grouper(freq='6MS')).sum() can be used interchangeably.


    For extra clarity you can compare ranges directly:

    >>> pd.date_range('2018-01-01', '2018-12-31', freq='6M')
    DatetimeIndex(['2018-01-31', '2018-07-31'], dtype='datetime64[ns]', freq='6M')
    
    >>> pd.date_range('2018-01-01', '2018-12-31', freq='6MS')
    DatetimeIndex(['2018-01-01', '2018-07-01'], dtype='datetime64[ns]', freq='6MS')