Search code examples
pythonpandaspandas-groupby

Pandas how to resample between specific dates and sum values within date ranges?


(This questions is a follow up of How to use pandas Grouper with 7d frequency and fill missing days with 0?)

I have the following dataframe

df = pd.DataFrame({
    'names': ['joe', 'joe', 'joe'],
    'dates': [dt.datetime(2019,6,1), dt.datetime(2019,6,5), dt.datetime(2019,7,1)],
    'start_date': dt.datetime(2019,5,1),
    'end_date': dt.datetime(2019,7,5),
    'values': [5,2,13]
})

  names      dates start_date   end_date  values
0   joe 2019-06-01 2019-05-01 2019-07-05       5
1   joe 2019-06-05 2019-05-01 2019-07-05       2
2   joe 2019-07-01 2019-05-01 2019-07-05      13

and I want to resample the data in 7 days blocks, with date ranges defined by start_date and end_date.

I was able to achieve the result I'm looking for by using the following workaround

temp_df = pd.DataFrame({
    'dates': [df.start_date.tolist()[0], dt.end_date.tolist()[0]],
    'names': df.names.tolist()[0],
    'values': 0
})

concat_df = pd.concat([df, temp_df], axis=0, sort=True)

concat_df.set_index('dates').groupby('names').resample('7D').sum()

which outputs

                  values
names dates             
joe   2019-05-01       0
      2019-05-08       0
      2019-05-15       0
      2019-05-22       0
      2019-05-29       5
      2019-06-05       2
      2019-06-12       0
      2019-06-19       0
      2019-06-26      13
      2019-07-03       0

which is what I'm looking for.

I'm pretty sure there are better ways to achieve that. Do you have any suggestion?

Thanks!


Solution

  • You just need to reindex you current time series, avoiding the pd.concat can yield better results.

    df.set_index('dates', inplace=True)
    def groupbyResample(groupby):
        date_range = pd.date_range(groupby.start_date.min(), groupby.end_date.max(), 
                                   freq='d', name='date_index')
    
        return date_range.to_frame(name='dates').join(groupby).resample('7D')['values'].sum()
    
    df.groupby('names').apply(groupbyResample).stack()
    

    Result:

    names  date_index
    joe    2019-05-01     0.0
           2019-05-08     0.0
           2019-05-15     0.0
           2019-05-22     0.0
           2019-05-29     5.0
           2019-06-05     2.0
           2019-06-12     0.0
           2019-06-19     0.0
           2019-06-26    13.0
           2019-07-03     0.0
    dtype: float64
    

    Using magic function %%timeit:

    %%timeit
    def groupbyResample(groupby):
    ..
    7.9 ms ± 352 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    

    Your Approach:

    %%timeit
    temp_df = pd.DataFrame({ ..
    9.1 ms ± 394 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    

    There isn't much performance-enhancement, however, using a function does scale well and you won't have write a lot of code yourself. That being said, I still feel that there is a more idiomatic way to do it.