Search code examples
pythonpandasdataframesum

How do I merge multiple dataframes and sum common values into column


I have many dataframe like:

df1                  df2                  and so on...
gene | counts        gene | counts   
KRAS    136          KRAS   96
DNAH5    3           DNAH5   4
TP53    105          TP53   20

I want to merge them and sum the column 'counts' so I end with only one dataframe

merged_df
gene | counts
KRAS    232       
DNAH5    7    
TP53    125 

I have tried to use pd.merge but it only accepts 2 dataframes at once, I have 14 dataframes. I used pd.concat for multiple dataframes but can't sum them after.


Solution

  • Indeed, pd.merge only merges 2 dataframes. But pd.join can join many, if they have the same index:

    # Some example data. Note the None in `df3`. We want our code to handle that well.
    df1 = pd.DataFrame({'gene': ['KRAS', 'DNAH5', 'TP53'], 'counts': [136, 3, 105]})
    df2 = pd.DataFrame({'gene': ['KRAS', 'DNAH5', 'TP53'], 'counts': [96, 4, 20]})
    df3 = pd.DataFrame({'gene': ['KRAS', 'DNAH5', 'TP53'], 'counts': [1000, None, 3000]})
    dfs = [df1, df2, df3]
    
    # We need the same index for the pd.DataFrame.join to work
    dfs = [df.set_index('gene') for df in dfs]
    
    # All the non-index columns need unique names, so chanding `columns` to `columns_0`, `columns_1`
    dfs = [df.rename(columns={'counts': f'counts_{i}'}) for i, df in enumerate(dfs)]
    
    # Actual join. We are joining first with the rest
    df = dfs[0].join(dfs[1:], how='outer')
    
    # Since we don't have any other data, we can just sum all columns.
    df.sum(1)
    

    This prints:

    gene
    KRAS     1232.0
    DNAH5       7.0
    TP53     3125.0