I have many dataframe like:
df1 df2 and so on...
gene | counts gene | counts
KRAS 136 KRAS 96
DNAH5 3 DNAH5 4
TP53 105 TP53 20
I want to merge them and sum the column 'counts' so I end with only one dataframe
merged_df
gene | counts
KRAS 232
DNAH5 7
TP53 125
I have tried to use pd.merge but it only accepts 2 dataframes at once, I have 14 dataframes. I used pd.concat for multiple dataframes but can't sum them after.
Indeed, pd.merge
only merges 2 dataframes. But pd.join
can join many, if they have the same index:
# Some example data. Note the None in `df3`. We want our code to handle that well.
df1 = pd.DataFrame({'gene': ['KRAS', 'DNAH5', 'TP53'], 'counts': [136, 3, 105]})
df2 = pd.DataFrame({'gene': ['KRAS', 'DNAH5', 'TP53'], 'counts': [96, 4, 20]})
df3 = pd.DataFrame({'gene': ['KRAS', 'DNAH5', 'TP53'], 'counts': [1000, None, 3000]})
dfs = [df1, df2, df3]
# We need the same index for the pd.DataFrame.join to work
dfs = [df.set_index('gene') for df in dfs]
# All the non-index columns need unique names, so chanding `columns` to `columns_0`, `columns_1`
dfs = [df.rename(columns={'counts': f'counts_{i}'}) for i, df in enumerate(dfs)]
# Actual join. We are joining first with the rest
df = dfs[0].join(dfs[1:], how='outer')
# Since we don't have any other data, we can just sum all columns.
df.sum(1)
This prints:
gene
KRAS 1232.0
DNAH5 7.0
TP53 3125.0