I have the following challenge: I have two Pandas Dataframes with information about eg. chemical substances and related to some additional information. E.g. production region or country. For example, like this:
data1 = {
'Substance' : ['Substance1', 'Substance2', 'Substance1', 'Substance3', 'Substance2', 'Substance1', 'Substance3', 'Substance2', 'Substance4'],
'Name' : ['Bayer', 'Sanofi', 'Pfizer', 'AstraZeneca', 'BionTech', 'Sanofi', 'Sanofi', 'Bayer', 'Pfizer'],
'Region' : ['Europe', 'Europe', 'Asia', 'Asia', 'North America', np.nan, np.nan, 'Europe', 'Asia'],
'Country' : ['France', np.nan, np.nan, 'P.R. China', 'United States', np.nan, np.nan, np.nan, 'India'],
'Misc' : [910, 200, 898, 910, 910, 12, 34, 700, 800],
}
df1 = pd.DataFrame(data1)
#print(df1)
and
data2 = {
'Substance' : ['Substance1', 'Substance2', 'Substance1', 'Substance3', 'Substance2'],
'Name' : ['Bayer', 'Sanofi', 'Pfizer', 'AstraZeneca', 'BionTech'],
'Region' : ['Europe', 'Europe', 'Asia', 'Asia', 'North America'],
'Country' : ['France', 'Germany', 'India', 'P.R. China', 'United States'],
'Misc' : [910, 200, 898, 910, 910],
}
df2 = pd.DataFrame(data2)
#print(df2)
In the first data frame all orders from one year are listed. It happens that a substance is also supplied by several manufacturers and that the row appears more than once (== number of orders). I would like to supplement the dataframe (df1) with information from certain columns of another dataframe (df2). The goal is, if in df1 for a substance and a manufacturer the value in the column 'Region' or 'Country' is empty, then please insert the value from the corresponding column from df2.
df3 = df1.loc[df1.Substance.isin(df2.Substance), ['Substance', 'Region', 'Country']] = df2[['Substance', 'Region', 'Country']]
#print(df3)
However, I then lose many rows (e.g. if substances are repeated). But I would keep all rows in df, just supplemented with information from df2.
I would expect the following result for df3:
| | Substance | Name | Region | Country | Misc |
-------------------------------------------------------------------
| 0 | Substance1 | Bayer | Europe | France | 910 |
| 1 | Substance2 | Sanofi | Europe | Germany | 200 |
| 2 | Substance1 | Pfizer | Asia | India | 898 |
| 3 | Substance3 | AstraZeneca | Asia | P.R. China | 910 |
| 4 | Substance2 | BionTech |North America | United States | 910 |
| 5 | Substance1 | Sanofi | Europe | France | 12 |
| 6 | Substance3 | Sanfoi | np.nan | np.nan | 34 |
| 7 | Substance2 | Bayer | Europe | Germany | 700 |
| 8 | Substance4 | Pfizer | Asia | India | 800 |
Any suggestions would be appreciated. Many thanks in advance!
EDIT:
I'm not sure if this is a new question or an edit of my old question - that's why I'm posting this as a comment and want to ask for your help again, especially @mozway.
data1 = {
'Substance' : ['Substance1', 'Substance2', 'Substance1', 'Substance3', 'Substance2', 'Substance1', 'Substance3', 'Substance2', 'Substance4', 'Product A'],
'Name' : ['Bayer', 'Sanofi', 'Pfizer', 'AstraZeneca', 'BionTech', 'Sanofi', 'Sanofi', 'Bayer', 'Pfizer', 'Braun'],
'Region' : ['Europe', 'Europe', 'Asia', 'Asia', 'North America', np.nan, np.nan, 'Europe', 'Asia', 'Asia'],
'Country' : ['France', np.nan, np.nan, 'P.R. China', 'United States', np.nan, np.nan, np.nan, 'India', 'Indonesia'],
'Misc' : [910, 200, 898, 910, 910, 12, 34, 700, 800, 10000],
}
df1 = pd.DataFrame(data1)
#print('line ', lineno(), '\n ', df1)
data2 = {
'Substance' : ['Substance1', 'Substance1', 'Substance1', 'Substance3', 'Substance2', 'Substance3'],
'Name' : ['Bayer', 'Sanofi', 'Pfizer', 'AstraZeneca', 'BionTech', 'Sanofi'],
'Region' : ['Europe', 'Europe', 'Asia', 'Asia', 'North America', np.nan],
'Country' : ['France', 'Germany', 'India', 'P.R. China', 'United States', 'Poland'],
'Misc' : [910, 200, 898, 910, 910,40],
}
df2 = pd.DataFrame(data2)
#print('line ', lineno(), '\n ', df2)
Using 'merge' as suggested by @mozway:
out = (df1[['Substance',
'Name',
# 'Region',
#'Country',
]]
.merge(df2#.drop(columns='Substance')
.groupby(['Substance',
'Name'
], as_index=False).first(),
how='left')
)
#print('line ', lineno(), '\n ',out)
It results in
Substance Name Region Country Misc
0 Substance1 Bayer Europe France 910.0
1 Substance2 Sanofi NaN NaN NaN
2 Substance1 Pfizer Asia India 898.0
3 Substance3 AstraZeneca Asia P.R. China 910.0
4 Substance2 BionTech North America United States 910.0
5 Substance1 Sanofi Europe Germany 200.0
6 Substance3 Sanofi NaN NaN NaN
7 Substance2 Bayer NaN NaN NaN
8 Substance4 Pfizer NaN NaN NaN
But I want the columns 'Region' and 'Country' to be completed by the values of the columns 'Region' and 'Country'. In the example of line 5 in the out-dataframe the value of Substance 1, distributed by Sanofi supplemented from line 1 in df2. That's great. But, the information from df1 in this certain columns is not considered. See, for instance, the line for Product A, where the values of column 'Region' and 'Country' are NaNs. Or line 2 of the out-dataframe where Substance 2 by Sanofihas NaNs in the Columns 'Region' and 'Country', even if the information for 'Region' is given in df1 (line 1).
I am looking for an approach, which keeps the information from df1 (if any exists --> not NaN). If not, I would like to add the information from df2. I really appreciate your support, many thanks in advance!
You can use a custom merge
:
out = (df1[['Substance', 'Name']]
.merge(df2.drop(columns='Substance')
.groupby('Name', as_index=False).first(),
how='left')
)
Output:
Substance Name Region Country Misc
0 Substance1 Bayer Europe France 910
1 Substance2 Sanofi Europe Germany 200
2 Substance1 Pfizer Asia India 898
3 Substance3 AstraZeneca Asia P.R. China 910
4 Substance2 BionTech North America United States 910
5 Substance1 Sanofi Europe Germany 200
6 Substance3 Sanofi Europe Germany 200
7 Substance2 Bayer Europe France 910
8 Substance4 Pfizer Asia India 898