Search code examples
pythonpandasdataframe

pandas search across multiple columns return one column if matches


Example data:

df1 = pd.DataFrame({
    'a': [1, 6, 3, 9],
    'b': ['A', 'B', 'C', 'D'],
    'c': [10, 20, 30, 40],
    'd': [100, 200, 300, 400]
})

df2 = pd.DataFrame({
    'm': [1, 5, 3, 7],
    'n': [2, 6, 8, 4],
    'o': [9, 10, 11, 12]
})

Requirement:
df1['a'] can occur anywhere of df2. I want to return df2['m'] irrespective of where the match is found.

After some googling and chatGpt, I found melting df2 and merging with df1 is helpful except for it doesn't check for a match in df2['m'].

Code:

df2_melted = df2.melt(id_vars=['m'], value_vars=['n', 'o'])
merged_df = df1.merge(df2_melted, left_on='a', right_on='value', how='left')
df1['e'] = merged_df['m']
print(df1)

Output:

a  b   c    d    e
1  A  10  100  NaN     # df1['a'] == df2['m']
6  B  20  200  5.0     # df1['a'] == df2['n']
3  C  30  300  NaN     # df1['a'] == df2['m']
9  D  40  400  1.0     # df1['a'] == df2['o']

Required Output:

a  b   c    d  e
1  A  10  100  1
6  B  20  200  5
3  C  30  300  3
9  D  40  400  1

If df2['m'] could also be added to value_vars while melting, it'd have resolved the issue. I tried it, it didn't work. Then checked docs, found that whatever is there in the id_vars, the remaining or a subset of the remaining can be part of value_vars. So this approach might not be correct or I'm missing something.

Then I thought, if df1['a'] matches df2['m'], then df1['e'] == df1['a'] == df2['m']. So just replacing NaN value with df1['a'] should work and it worked. But had to convert the column to int; because of NaN, it's changed to float.

Working complete Code:

df2_melted = df2.melt(id_vars=['m'], value_vars=['n', 'o'])
merged_df = df1.merge(df2_melted, left_on='a', right_on='value', how='left')
df1['e'] = merged_df['m']
df1['e'] = (df1['e'].fillna(df1['a'])).astype(int)

Seemed like even though it's a working solution, it's unnecessarily complicated: "try any solution: add more code to fix the issues as you proceed without changing the initial solution".

Any other better approach which can help with my requirement?


PS1: In above example, it's not mandatory that df1 and df2 will have the same number of rows.


Solution

  • Just duplicate the column m as a new column before melt (also use map instead of merge):

    df1['e'] = df1['a'].map(df2.eval('e=m').melt('e')
                               .set_index('value')['e'])
    

    Variant with merge:

    out = df1.merge(df2.assign(e=df2['m'])
                       .melt('e', value_name='a')
                       [['a', 'e']]
                   )
    

    Output:

       a  b   c    d  e
    0  1  A  10  100  1
    1  6  B  20  200  5
    2  3  C  30  300  3
    3  9  D  40  400  1