I have multiple dataframes that are structured as follows:
Molecule Name | Molecular weight | Score | Population Score | Population Error |
---|---|---|---|---|
A | 100 | 12 | 15 | 0.2 |
B | 205 | 0.4 | 17 | 0.8 |
C | 367 | 17 | 11 | 0.82 |
D | 510 | 9 | 19.6 | 0.1 |
Molecule Name | Molecular weight | Score | Population Score | Population Error |
---|---|---|---|---|
A | 100 | 20 | 15 | 0.2 |
B | 205 | 16 | 17 | 0.8 |
E | 367 | 11 | 11 | 0.82 |
F | 780 | 11 | 12 | 0.5 |
Imagine I had multiple dataframes where the molecule names, weights, and score can vary. But the population score and population error are the same across all dataframes.
I want to merge all the dataframes. The unique identifer is the Molecular weight. So if there are multiple dataframes which contain a molecular weight of 100, the scores of each get saved - take note of molecule C from the first dataframe and molecule E from the second - they have the same molecular weight so should be combined. I essentially want something that comes out like this:
Molecule Name | Molecular weight | Score1 | Score2 | Population Score | Population Error |
---|---|---|---|---|---|
A | 100 | 12 | 20 | 15 | 0.2 |
B | 205 | 0.4 | 16 | 17 | 0.8 |
C | 367 | 17 | 11 | 11 | 0.82 |
D | 510 | 9 | 0 | 19.6 | 0.1 |
F | 780 | 0 | 11 | 12 | 0.5 |
I have tried just simple merges but it always creates a new row for Molecule name E. I want to merge on Molecular weight. If the Molecular weight isn't in the original data frame then add a new row and populate the 0s. I hope this makes sense. Any and all help will be appreciated.
You could use a custom concat
with de-duplication of the Score
columns and groupby.first
to get the first molecule name:
dfs = [df1, df2] # could be more than 2 input DataFrames
out = (pd.concat([d.set_index(['Molecular weight', 'Population Score', 'Population Error'])
.rename(columns={'Score': f'Score{i}'})
for i, d in enumerate(dfs, start=1)], axis=1)
.groupby(level=0, axis=1).first()
.fillna(0, downcast='infer')
.reset_index()
)
Variant for future versions of pandas:
dfs = [df1, df2]
out = (pd.concat([d.set_index(['Molecular weight', 'Population Score', 'Population Error'])
.rename(columns={'Score': f'Score{i}'})
for i, d in enumerate(dfs, start=1)], axis=1)
.T.groupby(level=0).first().T
.fillna(0)
.reset_index()
)
Output:
Molecular weight Population Score Population Error Molecule Name Score1 Score2
0 100 15.0 0.20 A 12.0 20
1 205 17.0 0.80 B 0.4 16
2 367 11.0 0.82 C 17.0 11
3 510 19.6 0.10 D 9.0 0
4 780 12.0 0.50 F 0.0 11