I've seen plenty of examples and I can't make them work on my table. I have this table:
data = {'ID': ['Tom', 'Tom','Tom','Joseph','Joseph','Ben','Ben','Eden','Tim','Adam'], 'Tranche': ['Red', 'Red', 'Red', 'Blue','Blue','Blue','Blue','Red','Red','Blue'],'Totals':[100,100,100,50,50,90,90,70,60,70],'Sent':['2022-01-18','2022-02-19','2022-03-14','2021-04-14','2021-04-22','2022-03-03','2022-02-07','2022-01-04','2022-01-10','2022-01-15'],'Amount':[20,10,14,34,15,60,25,10,10,40],'Opened':['2021-12-29','2021-12-29','2021-12-29','2021-03-23','2021-03-23','2021-12-19','2021-12-19','2021-12-29','2021-12-29','2021-12-29']}
df = pd.DataFrame(data)
df["Opened"] = df["Opened"].astype('datetime64[ns]')
df["Sent"] = df["Sent"].astype('datetime64[ns]')
df['SentMonth'] = pd.to_datetime(df['Sent']).dt.to_period('M')
I want every ID to have every SentMonth, with amount zero if there is no amount (fillna will do if I can get to that point). I need this to make a later .cumsum() give correct results.
e.g. for Tom, the output should be something like this, but just more rows with more SentMonths. The day in the Sent column does not matter - but there must be one row for every month:
First solution that is always given is reindexing. I can't do this as every SentMonth is duplicated, and Sent will also always have duplicates in my full table.
df.resample('M').sum()
gives the error: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'RangeIndex'.
Which I tried to fix by doing
df1 = df.set_index('SentMonth').groupby('ID').resample('1D')['Amount'].ffill()
But this brings me back to the unique index error.
Is there any other approach that can get around this? Thanks! :)
This would be a solution:
month_range = df["SentMonth"].unique()
df_per_user = dict()
for user_id, user_df in df.groupby("ID"):
user_df.set_index("SentMonth", inplace=True)
duplicated_rows = user_df.index.duplicated()
if duplicated_rows.any():
keep_rows = user_df[duplicated_rows]
user_df = user_df[~duplicated_rows]
df_per_user[user_id] = user_df.reindex(month_range).sort_index()
if duplicated_rows.any():
df_per_user[user_id] = pd.concat([df_per_user[user_id], keep_rows]).sort_index()
all_df = pd.concat(df_per_user)
all_df["Amount"] = all_df["Amount"].fillna(0)
all_df.fillna(method="bfill", axis=0, inplace=True)
all_df = all_df.reset_index().drop("level_0", axis=1)
all_df = all_df[df.columns] # restore original column order
So you have to take care for the ID with duplicated SentMonth (Joseph). The result looks like:
ID Tranche Totals Sent Amount Opened SentMonth
0 Adam Blue 70.0 2022-01-15 0.0 2021-12-29 2021-04
1 Adam Blue 70.0 2022-01-15 40.0 2021-12-29 2022-01
2 Ben Blue 90.0 2022-02-07 0.0 2021-12-19 2022-02
3 Ben Blue 90.0 2022-02-07 0.0 2021-12-19 2022-03
4 Ben Blue 90.0 2022-02-07 0.0 2021-12-19 2021-04
5 Ben Blue 90.0 2022-02-07 0.0 2021-12-19 2022-01
6 Ben Blue 90.0 2022-02-07 25.0 2021-12-19 2022-02
7 Ben Blue 90.0 2022-03-03 60.0 2021-12-19 2022-03
8 Eden Red 70.0 2022-01-04 0.0 2021-12-29 2021-04
9 Eden Red 70.0 2022-01-04 10.0 2021-12-29 2022-01
10 Joseph Blue 50.0 2021-04-14 0.0 2021-03-23 2022-02
11 Joseph Blue 50.0 2021-04-14 0.0 2021-03-23 2022-03
12 Joseph Blue 50.0 2021-04-14 34.0 2021-03-23 2021-04
13 Joseph Blue 50.0 2021-04-22 15.0 2021-03-23 2021-04
14 Tim Red 60.0 2022-01-10 0.0 2021-12-29 2022-01
15 Tim Red 60.0 2022-01-10 0.0 2021-12-29 2022-02
16 Tim Red 60.0 2022-01-10 0.0 2021-12-29 2022-03
17 Tim Red 60.0 2022-01-10 0.0 2021-12-29 2021-04
18 Tim Red 60.0 2022-01-10 10.0 2021-12-29 2022-01
19 Tom Red 100.0 2022-01-18 0.0 2021-12-29 2022-02
20 Tom Red 100.0 2022-01-18 0.0 2021-12-29 2022-03
21 Tom Red 100.0 2022-01-18 0.0 2021-12-29 2021-04
22 Tom Red 100.0 2022-01-18 20.0 2021-12-29 2022-01
23 Tom Red 100.0 2022-02-19 10.0 2021-12-29 2022-02
24 Tom Red 100.0 2022-03-14 14.0 2021-12-29 2022-03