I have over 12 dataframes that I want to merge into a single dataframe, where row values match for each column (or null if they don't exist). Each dataframe has a different number of rows, but will never repeat values. The goal is to both identify common values and missing values.
Eg.df1
id label
1 a-1
2 b-2
3 z-10
Eg.df2
id label
1 b-2
2 d-4
3 e-5
Eg.df3
id label
1 a-1
2 d-4
3 f-6
Desired output
Eg.final
id df1 df2 df3
1 a-1 null a-1
2 b-2 b-2 null
3 null d-4 d-4
4 null e-5 null
5 null null f-6
6 z-10 null null
I've investigated join
, but these all seem to collapse values. insert
seemed plausible, but I can't rectify the different row sizes/matching values to the same row.
I want to maintain each df
as it's own column.
For multiple dataframes, you can use merge
with reduce
:
from functools import reduce
reduce(lambda left, right: pd.merge(left, right, on='label', how='outer'),
map(lambda d: d[1].drop(columns='id')
.assign(**{ f'df{d[0]}':lambda x: x['label'] }),
enumerate(dfs, 1))
).assign(id=lambda x:range(1, 1+len(x))).drop(columns='label') # this is just to drop the existing `label` and assign new `id`
Out:
df1 df2 df3 id
0 a-1 NaN a-1 1
1 b-2 b-2 NaN 2
2 NaN d-4 d-4 3
3 NaN e-5 NaN 4
4 NaN NaN f-6 5
5 z-10 NaN NaN 6
Another method is join
on the index, like:
renamed_dfs = list(map(lambda d: d[1].drop(columns='id')
.assign(**{ f'df{d[0]}':lambda x: x['label'] })
.set_index('label'),
enumerate(dfs, 1)
))
renamed_dfs[0].join(renamed_dfs[1:], how='outer').reset_index(drop=True).reset_index()
Output:
index df1 df2 df3
0 0 a-1 NaN a-1
1 1 b-2 b-2 NaN
2 2 z-10 NaN NaN
3 3 NaN d-4 d-4
4 4 NaN e-5 NaN
5 5 NaN NaN f-6