Search code examples
pythonpandasdataframedatetimedate-range

Creating sum of date ranges in Pandas


I have the following DataFrame, with over 3 million rows:

VALID_FROM   VALID_TO  VALUE
0 2022-01-01 2022-01-02      5
1 2022-01-01 2022-01-03      2
2 2022-01-02 2022-01-04      7
3 2022-01-03 2022-01-06      3

I want to create one large date_range with a sum of the values for each timestamp.

For the DataFrame above that would come out to:

       dates  val
0 2022-01-01    7
1 2022-01-02   14
2 2022-01-03   12
3 2022-01-04   10
4 2022-01-05    3
5 2022-01-06    3

However, as the DataFrame has a little over 3 Million rows I don't want to iterate over each row and I'm not sure how to do this without iterating. Any suggestions?

Currently my code looks like this:

new_df = pd.DataFrame()
for idx, row in dummy_df.iterrows():
    dr = pd.date_range(row["VALID_FROM"], end = row["VALID_TO"], freq = "D")
    tmp_df = pd.DataFrame({"dates": dr, "val": row["VALUE"]})
    new_df = pd.concat(objs=[new_df, tmp_df], ignore_index=True)

new_df.groupby("dates", as_index=False, group_keys=False).sum()

The result of the groupby would be my desired output.


Solution

  • If performance is important use Index.repeat with DataFrame.loc for new rows, create date colun with counter by GroupBy.cumcount and last aggregate sum:

    df['VALID_FROM'] = pd.to_datetime(df['VALID_FROM'])
    df['VALID_TO'] = pd.to_datetime(df['VALID_TO'])
    
    df1 = df.loc[df.index.repeat(df['VALID_TO'].sub(df['VALID_FROM']).dt.days + 1)]
    df1['dates'] = df1['VALID_FROM'] + pd.to_timedelta(df1.groupby(level=0).cumcount(),unit='d')
    
    df1 = df1.groupby('dates', as_index=False)['VALUE'].sum()
    print (df1)
           dates  VALUE
    0 2022-01-01      7
    1 2022-01-02     14
    2 2022-01-03     12
    3 2022-01-04     10
    4 2022-01-05      3
    5 2022-01-06      3