Search code examples
python-3.xpandasmergeconcatenation

dataframes list concatenation/merging introduces nan values


I have these dataframes:

import pandas as pd
import numpy as np
from functools import reduce

a = pd.DataFrame({'id':[1, 2, 3, 4, 5], 'gr_code': [121, 121, 134, 155, 156], 
                  'A_val': [0.1, np.nan, 0.3, np.nan, 0.5], 'B_val': [1.233, np.nan, 1.4, np.nan, 1.9]})

b = pd.DataFrame({'id':[1, 2, 3, 4, 5], 'gr_code': [121, 121, 134, 155, 156], 
                  'A_val': [np.nan, 0.2, np.nan, 0.4, np.nan], 'B_val': [np.nan, 1.56, np.nan, 1.1, np.nan]})

c = pd.DataFrame({'id':[1, 2, 3, 4, 5], 'gr_code': [121, 121, 134, 155, 156], 
                  'C_val': [121, np.nan, 334, np.nan, 555], 'D_val': [10.233, np.nan, 10.4, np.nan, 10.9]})

d = pd.DataFrame({'id':[1, 2, 3, 4, 5], 'gr_code': [121, 121, 134, 155, 156], 
                  'C_val': [np.nan, 322, np.nan, 454, np.nan], 'D_val': [np.nan, 10.56, np.nan, 10.1, np.nan]})

I am dropping the nan values:

a.dropna(inplace=True)
b.dropna(inplace=True)
c.dropna(inplace=True)
d.dropna(inplace=True)

And then , I want to merge them and have this result:

id  gr_code    A_val    B_val    C_val   D_val
1    121        0.1     1.233    121.0   10.233
2    121        0.2     1.56     322     10.56
3    134        0.3     1.400    334.0   10.400
4    155        0.4     1.10     454.0   10.10
5    156        0.5     1.900    555.0   10.900

but whatever I try , it introduces nan values.

For example:

df = pd.concat([a, b, c, d], axis=1)
df = df.loc[:,~df.columns.duplicated()]

gives:

id       gr_code     A_val   B_val      C_val   D_val
1.0         121.0     0.1     1.233     121.0   10.233
3.0         134.0     0.3     1.400     334.0   10.400
5.0         156.0     0.5     1.900     555.0   10.900
NaN         NaN       NaN     NaN       NaN     NaN
NaN         NaN       NaN     NaN       NaN     NaN

If I try:

df_list = [a, b, c, d]
df = reduce(lambda left, right: pd.merge(left, right, 
                                         on=['id', 'gr_code'],
                                         how='outer'), df_list)

it gives:

id  gr_code   A_val_x   B_val_x     A_val_y     B_val_y     C_val_x     D_val_x     C_val_y     D_val_y
1    121      0.1       1.233       NaN         NaN         121.0       10.233      NaN         NaN
3    134      0.3       1.400       NaN         NaN         334.0       10.400      NaN         NaN
5    156      0.5       1.900       NaN         NaN         555.0       10.900      NaN         NaN
2    121      NaN       NaN         0.2         1.56        NaN         NaN         322.0       10.56
4    155      NaN       NaN         0.4         1.10        NaN         NaN         454.0       10.10

more dataframes:

e = pd.DataFrame({'id':[1, 2, 3, 4, 5], 'gr_code': [121, 121, 134, 155, 156], 
                  'E_val': [0.11, np.nan, 0.13, np.nan, 0.35], 'F_val': [11.233, np.nan, 11.4, np.nan, 11.9]})

f = pd.DataFrame({'id':[1, 2, 3, 4, 5], 'gr_code': [121, 121, 134, 155, 156], 
                  'E_val': [np.nan, 3222, np.nan, 4541, np.nan], 'F_val': [np.nan, 110.56, np.nan, 101.1, np.nan]})

Solution

  • You can use concat and merge the duplicated columns:

    df = (pd.concat([d.set_index(['id', 'gr_code']) for d in df_list], axis=1)
            .groupby(level=0, axis=1).first().reset_index()
         )
    

    output:

       id  gr_code  A_val  B_val  C_val   D_val
    0   1      121    0.1  1.233  121.0  10.233
    1   2      121    0.2  1.560  322.0  10.560
    2   3      134    0.3  1.400  334.0  10.400
    3   4      155    0.4  1.100  454.0  10.100
    4   5      156    0.5  1.900  555.0  10.900