Search code examples
pythonpandasdataframedatetimeresampling

Add missing months to dataframe


I've seen plenty of examples and I can't make them work on my table. I have this table:

data = {'ID': ['Tom', 'Tom','Tom','Joseph','Joseph','Ben','Ben','Eden','Tim','Adam'], 'Tranche': ['Red', 'Red', 'Red', 'Blue','Blue','Blue','Blue','Red','Red','Blue'],'Totals':[100,100,100,50,50,90,90,70,60,70],'Sent':['2022-01-18','2022-02-19','2022-03-14','2021-04-14','2021-04-22','2022-03-03','2022-02-07','2022-01-04','2022-01-10','2022-01-15'],'Amount':[20,10,14,34,15,60,25,10,10,40],'Opened':['2021-12-29','2021-12-29','2021-12-29','2021-03-23','2021-03-23','2021-12-19','2021-12-19','2021-12-29','2021-12-29','2021-12-29']}  
df = pd.DataFrame(data)  
df["Opened"] = df["Opened"].astype('datetime64[ns]')
df["Sent"] = df["Sent"].astype('datetime64[ns]')
df['SentMonth'] = pd.to_datetime(df['Sent']).dt.to_period('M')

enter image description here

I want every ID to have every SentMonth, with amount zero if there is no amount (fillna will do if I can get to that point). I need this to make a later .cumsum() give correct results.

e.g. for Tom, the output should be something like this, but just more rows with more SentMonths. The day in the Sent column does not matter - but there must be one row for every month: enter image description here

First solution that is always given is reindexing. I can't do this as every SentMonth is duplicated, and Sent will also always have duplicates in my full table.

df.resample('M').sum() gives the error: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'RangeIndex'. Which I tried to fix by doing

df1 = df.set_index('SentMonth').groupby('ID').resample('1D')['Amount'].ffill()

But this brings me back to the unique index error.

Is there any other approach that can get around this? Thanks! :)


Solution

  • This would be a solution:

    month_range = df["SentMonth"].unique()
    df_per_user = dict()
    for user_id, user_df in df.groupby("ID"):
        user_df.set_index("SentMonth", inplace=True)
        duplicated_rows = user_df.index.duplicated()
        if duplicated_rows.any():
            keep_rows = user_df[duplicated_rows]
            user_df = user_df[~duplicated_rows]
    
        df_per_user[user_id] = user_df.reindex(month_range).sort_index()
    
        if duplicated_rows.any():
            df_per_user[user_id] = pd.concat([df_per_user[user_id], keep_rows]).sort_index()
    
    all_df = pd.concat(df_per_user)
    all_df["Amount"] = all_df["Amount"].fillna(0)
    all_df.fillna(method="bfill", axis=0, inplace=True)
    all_df = all_df.reset_index().drop("level_0", axis=1)
    all_df = all_df[df.columns] # restore original column order
    

    So you have to take care for the ID with duplicated SentMonth (Joseph). The result looks like:

            ID Tranche  Totals       Sent  Amount     Opened SentMonth
    0     Adam    Blue    70.0 2022-01-15     0.0 2021-12-29   2021-04
    1     Adam    Blue    70.0 2022-01-15    40.0 2021-12-29   2022-01
    2      Ben    Blue    90.0 2022-02-07     0.0 2021-12-19   2022-02
    3      Ben    Blue    90.0 2022-02-07     0.0 2021-12-19   2022-03
    4      Ben    Blue    90.0 2022-02-07     0.0 2021-12-19   2021-04
    5      Ben    Blue    90.0 2022-02-07     0.0 2021-12-19   2022-01
    6      Ben    Blue    90.0 2022-02-07    25.0 2021-12-19   2022-02
    7      Ben    Blue    90.0 2022-03-03    60.0 2021-12-19   2022-03
    8     Eden     Red    70.0 2022-01-04     0.0 2021-12-29   2021-04
    9     Eden     Red    70.0 2022-01-04    10.0 2021-12-29   2022-01
    10  Joseph    Blue    50.0 2021-04-14     0.0 2021-03-23   2022-02
    11  Joseph    Blue    50.0 2021-04-14     0.0 2021-03-23   2022-03
    12  Joseph    Blue    50.0 2021-04-14    34.0 2021-03-23   2021-04
    13  Joseph    Blue    50.0 2021-04-22    15.0 2021-03-23   2021-04
    14     Tim     Red    60.0 2022-01-10     0.0 2021-12-29   2022-01
    15     Tim     Red    60.0 2022-01-10     0.0 2021-12-29   2022-02
    16     Tim     Red    60.0 2022-01-10     0.0 2021-12-29   2022-03
    17     Tim     Red    60.0 2022-01-10     0.0 2021-12-29   2021-04
    18     Tim     Red    60.0 2022-01-10    10.0 2021-12-29   2022-01
    19     Tom     Red   100.0 2022-01-18     0.0 2021-12-29   2022-02
    20     Tom     Red   100.0 2022-01-18     0.0 2021-12-29   2022-03
    21     Tom     Red   100.0 2022-01-18     0.0 2021-12-29   2021-04
    22     Tom     Red   100.0 2022-01-18    20.0 2021-12-29   2022-01
    23     Tom     Red   100.0 2022-02-19    10.0 2021-12-29   2022-02
    24     Tom     Red   100.0 2022-03-14    14.0 2021-12-29   2022-03