Search code examples
pythonpandasmerge

Merging multiple dataframes together with pandas


I have multiple dataframes that are structured as follows:

Molecule Name Molecular weight Score Population Score Population Error
A 100 12 15 0.2
B 205 0.4 17 0.8
C 367 17 11 0.82
D 510 9 19.6 0.1
Molecule Name Molecular weight Score Population Score Population Error
A 100 20 15 0.2
B 205 16 17 0.8
E 367 11 11 0.82
F 780 11 12 0.5

Imagine I had multiple dataframes where the molecule names, weights, and score can vary. But the population score and population error are the same across all dataframes.

I want to merge all the dataframes. The unique identifer is the Molecular weight. So if there are multiple dataframes which contain a molecular weight of 100, the scores of each get saved - take note of molecule C from the first dataframe and molecule E from the second - they have the same molecular weight so should be combined. I essentially want something that comes out like this:

Molecule Name Molecular weight Score1 Score2 Population Score Population Error
A 100 12 20 15 0.2
B 205 0.4 16 17 0.8
C 367 17 11 11 0.82
D 510 9 0 19.6 0.1
F 780 0 11 12 0.5

I have tried just simple merges but it always creates a new row for Molecule name E. I want to merge on Molecular weight. If the Molecular weight isn't in the original data frame then add a new row and populate the 0s. I hope this makes sense. Any and all help will be appreciated.


Solution

  • You could use a custom concat with de-duplication of the Score columns and groupby.first to get the first molecule name:

    dfs = [df1, df2] # could be more than 2 input DataFrames
    
    out = (pd.concat([d.set_index(['Molecular weight', 'Population Score', 'Population Error'])
                       .rename(columns={'Score': f'Score{i}'})
                      for i, d in enumerate(dfs, start=1)], axis=1)
             .groupby(level=0, axis=1).first()
             .fillna(0, downcast='infer')
             .reset_index()
          )
    

    Variant for future versions of pandas:

    dfs = [df1, df2]
    
    out = (pd.concat([d.set_index(['Molecular weight', 'Population Score', 'Population Error'])
                       .rename(columns={'Score': f'Score{i}'})
                      for i, d in enumerate(dfs, start=1)], axis=1)
             .T.groupby(level=0).first().T
             .fillna(0)
             .reset_index()
          )
    

    Output:

       Molecular weight  Population Score  Population Error Molecule Name  Score1  Score2
    0               100              15.0              0.20             A    12.0      20
    1               205              17.0              0.80             B     0.4      16
    2               367              11.0              0.82             C    17.0      11
    3               510              19.6              0.10             D     9.0       0
    4               780              12.0              0.50             F     0.0      11