Search code examples
pythonpandasdataframefrequencyreindex

Intuition behind resample function in pandas


Problem: I have a data set of daily gold prices. I want to take the mean in each month and done this by resampling the data. Next, I want to change the date of each month to end of each business month, I have done this by resampling the new mean data and resampled again. The code is as below.

gold_df = gold_df.resample(rule='M').mean()
gold_df = gold_df.resample(rule='BM').last()

>>> gold_df
Date    
1979-01-31  227.215217
1979-02-28  245.670000
1979-03-30  NaN
1979-04-30  238.664286
1979-05-31  257.782609

Issue: However, I get NaN in some of the months. Shouldn't resampling method just change the dates in this case, as there is only one value in each month (due to the first resample step). I solved this by doing the below, but am confused in understanding the intuition behind resample() and why I need to first resample in 'BMS' frequency and then 'BM' again.

gold_df = gold_df.resample(rule='M').mean()
gold_df = gold_df.resample(rule='BMS').last()
gold_df = gold_df.resample(rule='BM').last()

>>> gold_df
Date    
1978-12-29  226.000000
1979-01-31  227.215217
1979-02-28  245.670000
1979-03-30  242.047727
1979-04-30  238.664286

Solution

  • This has nothing to do with missing values in your data. NaN occurs here when the last day of the month is not a business day.

    Look at the following example:

    idx = pd.date_range('2020-01-01','2020-12-31')
    s = pd.Series(pd.np.random.randint(20, 30, len(idx)), index=idx)
    s.resample('M').mean().resample('BM').last().to_frame().assign(day=s.resample('M').mean().index.day_name())
    #                    0        day
    #2020-01-31  24.612903     Friday
    #2020-02-28        NaN   Saturday
    #2020-03-31  24.096774    Tuesday
    #2020-04-30  24.433333   Thursday
    #2020-05-29        NaN     Sunday
    #2020-06-30  23.800000    Tuesday
    #2020-07-31  23.677419     Friday
    #2020-08-31  23.870968     Monday
    #2020-09-30  25.333333  Wednesday
    #2020-10-30        NaN   Saturday
    #2020-11-30  24.266667     Monday
    #2020-12-31  23.806452   Thursday
    

    You can circumvent this by setting label and closed to left for 'BM' (default is right):

    s.resample('M').mean().resample('BM', label='left', closed='left').last()
    #2020-01-31    24.612903
    #2020-02-28    23.896552
    #2020-03-31    24.096774
    #2020-04-30    24.433333
    #2020-05-29    24.064516
    #2020-06-30    23.800000
    #2020-07-31    23.677419
    #2020-08-31    23.870968
    #2020-09-30    25.333333
    #2020-10-30    24.838710
    #2020-11-30    24.266667
    #2020-12-31    23.806452
    #Freq: BM, dtype: float64
    assert all(s.resample('M').mean().resample('BM', label='left', closed='left').last() == s.resample('M').mean().resample('BMS').last().resample('BM').last())