I have the following DataFrame, with over 3 million rows:
VALID_FROM VALID_TO VALUE
0 2022-01-01 2022-01-02 5
1 2022-01-01 2022-01-03 2
2 2022-01-02 2022-01-04 7
3 2022-01-03 2022-01-06 3
I want to create one large date_range with a sum of the values for each timestamp.
For the DataFrame above that would come out to:
dates val
0 2022-01-01 7
1 2022-01-02 14
2 2022-01-03 12
3 2022-01-04 10
4 2022-01-05 3
5 2022-01-06 3
However, as the DataFrame has a little over 3 Million rows I don't want to iterate over each row and I'm not sure how to do this without iterating. Any suggestions?
Currently my code looks like this:
new_df = pd.DataFrame()
for idx, row in dummy_df.iterrows():
dr = pd.date_range(row["VALID_FROM"], end = row["VALID_TO"], freq = "D")
tmp_df = pd.DataFrame({"dates": dr, "val": row["VALUE"]})
new_df = pd.concat(objs=[new_df, tmp_df], ignore_index=True)
new_df.groupby("dates", as_index=False, group_keys=False).sum()
The result of the groupby would be my desired output.
If performance is important use Index.repeat
with DataFrame.loc
for new rows, create date
colun with counter by GroupBy.cumcount
and last aggregate sum
:
df['VALID_FROM'] = pd.to_datetime(df['VALID_FROM'])
df['VALID_TO'] = pd.to_datetime(df['VALID_TO'])
df1 = df.loc[df.index.repeat(df['VALID_TO'].sub(df['VALID_FROM']).dt.days + 1)]
df1['dates'] = df1['VALID_FROM'] + pd.to_timedelta(df1.groupby(level=0).cumcount(),unit='d')
df1 = df1.groupby('dates', as_index=False)['VALUE'].sum()
print (df1)
dates VALUE
0 2022-01-01 7
1 2022-01-02 14
2 2022-01-03 12
3 2022-01-04 10
4 2022-01-05 3
5 2022-01-06 3