Example data:
df1 = pd.DataFrame({
'a': [1, 6, 3, 9],
'b': ['A', 'B', 'C', 'D'],
'c': [10, 20, 30, 40],
'd': [100, 200, 300, 400]
})
df2 = pd.DataFrame({
'm': [1, 5, 3, 7],
'n': [2, 6, 8, 4],
'o': [9, 10, 11, 12]
})
Requirement:
df1['a']
can occur anywhere of df2
. I want to return df2['m']
irrespective of where the match is found.
After some googling and chatGpt, I found melting df2
and merging with df1
is helpful except for it doesn't check for a match in df2['m']
.
Code:
df2_melted = df2.melt(id_vars=['m'], value_vars=['n', 'o'])
merged_df = df1.merge(df2_melted, left_on='a', right_on='value', how='left')
df1['e'] = merged_df['m']
print(df1)
Output:
a b c d e
1 A 10 100 NaN # df1['a'] == df2['m']
6 B 20 200 5.0 # df1['a'] == df2['n']
3 C 30 300 NaN # df1['a'] == df2['m']
9 D 40 400 1.0 # df1['a'] == df2['o']
Required Output:
a b c d e
1 A 10 100 1
6 B 20 200 5
3 C 30 300 3
9 D 40 400 1
If df2['m']
could also be added to value_vars
while melting, it'd have resolved the issue. I tried it, it didn't work. Then checked docs, found that whatever is there in the id_vars
, the remaining or a subset of the remaining can be part of value_vars
. So this approach might not be correct or I'm missing something.
Then I thought, if df1['a']
matches df2['m']
, then df1['e'] == df1['a'] == df2['m']
. So just replacing NaN
value with df1['a']
should work and it worked. But had to convert the column to int; because of NaN
, it's changed to float.
Working complete Code:
df2_melted = df2.melt(id_vars=['m'], value_vars=['n', 'o'])
merged_df = df1.merge(df2_melted, left_on='a', right_on='value', how='left')
df1['e'] = merged_df['m']
df1['e'] = (df1['e'].fillna(df1['a'])).astype(int)
Seemed like even though it's a working solution, it's unnecessarily complicated: "try any solution: add more code to fix the issues as you proceed without changing the initial solution".
Any other better approach which can help with my requirement?
PS1: In above example, it's not mandatory that df1 and df2 will have the same number of rows.
Just duplicate the column m
as a new column before melt
(also use map
instead of merge
):
df1['e'] = df1['a'].map(df2.eval('e=m').melt('e')
.set_index('value')['e'])
Variant with merge
:
out = df1.merge(df2.assign(e=df2['m'])
.melt('e', value_name='a')
[['a', 'e']]
)
Output:
a b c d e
0 1 A 10 100 1
1 6 B 20 200 5
2 3 C 30 300 3
3 9 D 40 400 1