Search code examples
pythonpandasjoinmerge

Pandas Merge - How can I merge two dataframes and keep all of the values without creating multiple new columns (_x _y)?


I am attempting to merge many columns with the same name from two dfs into a df with one combined column. The result is creating new columns Example_column_x and Example_column_y. I believe this is from the NaN value? I am wanting to keep all of the information (excluding NaN) from both dataframes combined into only 1 column; without the appended _x or _y. I would like to avoid having to manipulate the dataframe after merging the dataframes as there are many columns that this is happening to. Thank you

import pandas as pd
import numpy as np


df= [['person one', 10], ['person two', np.nan]]
df= pd.DataFrame(df, columns=['Name', 'Example_column'])
df:
      Name  Example_column
person one            10.0
person two             NaN

df2= [['person one', np.nan], ['person two', 'excused']]
df2= pd.DataFrame(df2, columns=['Name', 'Example_column'])
df2: 
      Name Example_column
person one            NaN
person two        excused


Resulting df_final that I do not want= pd.merge(df, df2, on=['Name'])
print(df_final)
resulting df_final:
      Name  Example_column_x Example_column_y
person one              10.0              NaN
person two               NaN          excused


df_final I want:
      Name    Example_column 
person one              10.0              
person two           excused

Solution

  • If you want to get the first non-NaN (if any), then concat and keep the first non-NaN per group:

    out = pd.concat([df, df2]).groupby('Name', as_index=False, sort=False).first()
    

    Output:

             Name Example_column
    0  person one           10.0
    1  person two        excused
    

    Alternatively, assuming a more complex logic: keep all values but NaNs, except if there are only NaNs, then just keep one.

    df= [['person one', 10], ['person two', np.nan], ['person three', 30]]
    df= pd.DataFrame(df, columns=['Name', 'Example_column'])
    
    df2= [['person one', np.nan], ['person two', np.nan], ['person three', 'excused']]
    df2= pd.DataFrame(df2, columns=['Name', 'Example_column'])
    
    tmp = pd.concat([df, df2]).sort_values(by='Example_column', key=lambda x: x.notna())
    out = tmp[tmp['Name'].duplicated()|tmp['Example_column'].notna()
    

    Output:

               Name Example_column
    1    person two            NaN
    0    person one           10.0
    2  person three           30.0
    2  person three        excused