Search code examples
pythonpandasdataframefillna

Add missing rows for each Client - Python / Pandas


I have df with Weeks, Months and Years.

week = ['01/03/2022 - 01/09/2022', '01/10/2022 - 01/16/2022', '01/17/2022 - 01/23/2022', '01/24/2022 - 01/30/2022']
month = ["January", "January", "January", "January"]
year = [2022, 2022, 2022, 2022]

myDict = {}

myDict["Week"] = week
myDict["Month"] = month 
myDict["Year"] = year

dates_df = pd.DataFrame(data=myDict)
dates_df
         Week              Month    Year

01/03/2022 - 01/09/2022   January   2022
01/10/2022 - 01/16/2022   January   2022
01/17/2022 - 01/23/2022   January   2022
01/24/2022 - 01/30/2022   January   2022

Data looks like that (Previously Grouped):

test_data = {'CLient Id': [1,1,1,1,2,2,2,3,3],
    'Client Name': ['Tom Holland', 'Tom Holland', 'Tom Holland', 'Tom Holland', 'Brad Pitt', 'Brad Pitt', 'Brad Pitt', 'Anna Delvey', 'Anna Delvey'],
    'City': ['New York', 'New York', 'New York', 'New York', 'Los Angeles', 'Los Angeles', 'Los Angeles', 'New York', 'New York'],    
    'Week': ['01/03/2022 - 01/09/2022', '01/10/2022 - 01/16/2022', '01/17/2022 - 01/23/2022', '01/24/2022 - 01/30/2022',
            '01/03/2022 - 01/09/2022', '01/10/2022 - 01/16/2022', '01/24/2022 - 01/30/2022', '01/03/2022 - 01/09/2022',
            '01/24/2022 - 01/30/2022'], 
    'Month': ['January', 'January', 'January', 'January', 'January', 'January', 'January', 'January', 'January'], 
    'Year': [2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022],
    'Spent': [1000, 750, 1200, 850, 777, 1500, 1200, 1500, 1200]}

test_df = pd.DataFrame(data=test_data)
test_df
Client Id    Client Name          City                Week               Month    Year   Spent

1            Tom Holland       New York      01/03/2022 - 01/09/2022    January   2022   1000
1            Tom Holland       New York      01/10/2022 - 01/16/2022    January   2022   750
1            Tom Holland       New York      01/17/2022 - 01/23/2022    January   2022   1200
1            Tom Holland       New York      01/24/2022 - 01/30/2022    January   2022   850
2            Brad Pitt         Los Angeles   01/03/2022 - 01/09/2022    January   2022   777
2            Brad Pitt         Los Angeles   01/10/2022 - 01/16/2022    January   2022   1500
2            Brad Pitt         Los Angeles   01/24/2022 - 01/30/2022    January   2022   1200
3            Anna Delvey       New York      01/03/2022 - 01/09/2022    January   2022   1500
3            Anna Delvey       New York      01/24/2022 - 01/30/2022    January   2022   1200

I need to create additional rows of Client Name, City, Week, Month and Year for each Client (Id) which missing Week, from the date df + Spent column in this case should be 0.

Output I need:

Client Id    Client Name          City                  Week               Month    Year   Spent

1            Tom Holland        New York       01/03/2022 - 01/09/2022    January   2022   1000
1            Tom Holland        New York       01/10/2022 - 01/16/2022    January   2022   750
1            Tom Holland        New York       01/17/2022 - 01/23/2022    January   2022   1200
1            Tom Holland        New York       01/24/2022 - 01/30/2022    January   2022   850
2            Brad Pitt          Los Angeles    01/03/2022 - 01/09/2022    January   2022   777
2            Brad Pitt          Los Angeles    01/10/2022 - 01/16/2022    January   2022   1500
2            Brad Pitt          Los Angeles    01/17/2022 - 01/23/2022    January   2022   0
2            Brad Pitt          Los Angeles    01/24/2022 - 01/30/2022    January   2022   1200
3            Anna Delvey        New York       01/03/2022 - 01/09/2022    January   2022   1500
3            Anna Delvey        New York       01/10/2022 - 01/16/2022    January   2022   0
3            Anna Delvey        New York       01/17/2022 - 01/23/2022    January   2022   0
3            Anna Delvey        New York       01/24/2022 - 01/30/2022    January   2022   1200

I was trying to code it, but got no luck.


Solution

  • You could pivot + reindex + fillna (to get the missing data) + stack (to get back to the previous shape):

    columns = ['Week','Month','Year']
    out = (test_df.pivot(['CLient Id', 'Client Name', 'City'], columns, ['Spent'])
           .reindex(pd.MultiIndex.from_arrays(dates_df.assign(Spent='Spent').to_numpy()[:, [-1,0,1,2]].T, 
                                              names=[None]+columns), axis=1)
           .fillna(0).stack(level=columns).reset_index())
    

    Output:

        CLient Id  Client Name         City                     Week    Month  Year   Spent  
    0           1  Tom Holland     New York  01/03/2022 - 01/09/2022  January  2022  1000.0  
    1           1  Tom Holland     New York  01/10/2022 - 01/16/2022  January  2022   750.0  
    2           1  Tom Holland     New York  01/17/2022 - 01/23/2022  January  2022  1200.0  
    3           1  Tom Holland     New York  01/24/2022 - 01/30/2022  January  2022   850.0  
    4           2    Brad Pitt  Los Angeles  01/03/2022 - 01/09/2022  January  2022   777.0  
    5           2    Brad Pitt  Los Angeles  01/10/2022 - 01/16/2022  January  2022  1500.0  
    6           2    Brad Pitt  Los Angeles  01/17/2022 - 01/23/2022  January  2022     0.0  
    7           2    Brad Pitt  Los Angeles  01/24/2022 - 01/30/2022  January  2022  1200.0  
    8           3  Anna Delvey     New York  01/03/2022 - 01/09/2022  January  2022  1500.0  
    9           3  Anna Delvey     New York  01/10/2022 - 01/16/2022  January  2022     0.0  
    10          3  Anna Delvey     New York  01/17/2022 - 01/23/2022  January  2022     0.0  
    11          3  Anna Delvey     New York  01/24/2022 - 01/30/2022  January  2022  1200.0  
    

    If you have more than one columns to fill with 0, you could use:

    columns = ['Week','Month','Year']
    value_columns = ['Spent', ...]
    new_df = (pd.DataFrame({'key':[1]*len(value_columns),'New':value_columns})
              .merge(dates_df.assign(key=1)).drop(columns='key'))
    out = (test_df.pivot(['CLient Id', 'Client Name', 'City'], columns, value_columns)
           .reindex(pd.MultiIndex.from_arrays(zip(*new_df.to_numpy()), 
                                              names=[None] + columns), axis=1)
           .fillna(0).stack(level=columns).reset_index())