I am attempting to merge many columns with the same name from two dfs into a df with one combined column. The result is creating new columns Example_column_x and Example_column_y. I believe this is from the NaN value? I am wanting to keep all of the information (excluding NaN) from both dataframes combined into only 1 column; without the appended _x or _y. I would like to avoid having to manipulate the dataframe after merging the dataframes as there are many columns that this is happening to. Thank you
import pandas as pd
import numpy as np
df= [['person one', 10], ['person two', np.nan]]
df= pd.DataFrame(df, columns=['Name', 'Example_column'])
df:
Name Example_column
person one 10.0
person two NaN
df2= [['person one', np.nan], ['person two', 'excused']]
df2= pd.DataFrame(df2, columns=['Name', 'Example_column'])
df2:
Name Example_column
person one NaN
person two excused
Resulting df_final that I do not want= pd.merge(df, df2, on=['Name'])
print(df_final)
resulting df_final:
Name Example_column_x Example_column_y
person one 10.0 NaN
person two NaN excused
df_final I want:
Name Example_column
person one 10.0
person two excused
If you want to get the first non-NaN (if any), then concat
and keep the first non-NaN per group:
out = pd.concat([df, df2]).groupby('Name', as_index=False, sort=False).first()
Output:
Name Example_column
0 person one 10.0
1 person two excused
Alternatively, assuming a more complex logic: keep all values but NaNs, except if there are only NaNs, then just keep one.
df= [['person one', 10], ['person two', np.nan], ['person three', 30]]
df= pd.DataFrame(df, columns=['Name', 'Example_column'])
df2= [['person one', np.nan], ['person two', np.nan], ['person three', 'excused']]
df2= pd.DataFrame(df2, columns=['Name', 'Example_column'])
tmp = pd.concat([df, df2]).sort_values(by='Example_column', key=lambda x: x.notna())
out = tmp[tmp['Name'].duplicated()|tmp['Example_column'].notna()
Output:
Name Example_column
1 person two NaN
0 person one 10.0
2 person three 30.0
2 person three excused