Search code examples
pythonpandasdatetimedataframedata-cleaning

Most elegant way to join two unequal data frames in python and fill-in non-existent values with 0


Say I have two data frames x_2016 and y_2017 with the columns index, 0, 1, %, and date. I am interested in the columns index and %. The problem I'm running into is that I need them to be one dataframe where the index is W11 through W15 irrespective of the year for the values in the column %, but since the indices don't overlap completely in terms of weeks, I will have some rows that need to be filled in with 0.

x_2016


    index       0       1           %           date
    2016 W12    16.0    0           2.5         2016-03-28
    2016 W13    38.0    5.0         43.0        2016-04-04
    2016 W14    48.0    13.0        63.0        2016-04-11
    2016 W15    50.0    18.0        1.0         2016-04-18

y_2017

   index         0      1           %           date
   2017 W11     16.0    8.0         40.0        2017-03-13
   2017 W12     20.0    16.0        19.0        2017-03-27
   2017 W13     34.0    27.0        6.0         2017-04-03
   2017 W14     28.0    32.0        17.0        2017-04-10

Ultimately the dataframe should look like this:

final

index     %_2016          %_2017
W11       0               40.0
W12       2.5             19.0
W13       43.0            6.0
W14       63.0            17.0       
W15       1.0             0

What's the most elegant way to do this in python?


Solution

  • You need to extract the "Wxx" information from each data frame to a new column and then merge the data frames on that column. Lastly, select just the columns of interest and sort by the W values.

    x_2016['W_index'] = x_2016['index'].str.extract('(W\d\d)', expand=True)
    
    y_2017['W_index'] = y_2017['index'].str.extract('(W\d\d)', expand=True)
    
    pd.merge(
        left=x_2016, 
        right=y_2017, 
        how='outer', 
        on='W_index', 
        suffixes=('_2016', '_2017'))[
            ['W_index', '%_2016', '%_2017']
        ].fillna(0).sort_values('W_index').reset_index(drop=True)
    
    # returns:
       W_index  %_2016  %_2017
    0      W11     0.0    40.0
    1      W12     2.5    19.0
    2      W13    43.0     6.0
    3      W14    63.0    17.0
    4      W15     1.0     0.0