Search code examples
pythonpandasdataframe

How to merge and match different length dataframes (lists) in Python/Pandas


I have over 12 dataframes that I want to merge into a single dataframe, where row values match for each column (or null if they don't exist). Each dataframe has a different number of rows, but will never repeat values. The goal is to both identify common values and missing values.

Eg.df1

id  label
1   a-1  
2   b-2  
3   z-10  

Eg.df2

id  label
1   b-2  
2   d-4  
3   e-5

Eg.df3

id  label
1   a-1
2   d-4  
3   f-6

Desired output

Eg.final

id  df1  df2  df3
1   a-1  null a-1
2   b-2  b-2  null 
3   null d-4  d-4
4   null e-5  null
5   null null f-6
6   z-10 null null

I've investigated join, but these all seem to collapse values. insert seemed plausible, but I can't rectify the different row sizes/matching values to the same row. I want to maintain each df as it's own column.


Solution

  • For multiple dataframes, you can use merge with reduce:

    from functools import reduce
    
    reduce(lambda left, right: pd.merge(left, right, on='label', how='outer'), 
           map(lambda d: d[1].drop(columns='id')
                             .assign(**{ f'df{d[0]}':lambda x: x['label'] }), 
               enumerate(dfs, 1)) 
          ).assign(id=lambda x:range(1, 1+len(x))).drop(columns='label')     # this is just to drop the existing `label` and assign new `id`
    

    Out:

        df1  df2  df3  id
    0   a-1  NaN  a-1   1
    1   b-2  b-2  NaN   2
    2   NaN  d-4  d-4   3
    3   NaN  e-5  NaN   4
    4   NaN  NaN  f-6   5
    5  z-10  NaN  NaN   6
    

    Another method is join on the index, like:

    renamed_dfs = list(map(lambda d: d[1].drop(columns='id')
                                         .assign(**{ f'df{d[0]}':lambda x: x['label'] })
                                         .set_index('label'),
                           enumerate(dfs, 1)
                       ))
    renamed_dfs[0].join(renamed_dfs[1:], how='outer').reset_index(drop=True).reset_index()
    

    Output:

       index   df1  df2  df3
    0      0   a-1  NaN  a-1
    1      1   b-2  b-2  NaN
    2      2  z-10  NaN  NaN
    3      3   NaN  d-4  d-4
    4      4   NaN  e-5  NaN
    5      5   NaN  NaN  f-6