python pandas dataframe merge multiple-columns

Python - Edit and Replace column values in a dataframe based on a column in another dataframe

I'm new to python and struck in a problem.

df1 = pd.DataFrame({'col1': ['apple', 'banana', 'cherry', 'apple', 'cherry']})
df2 = pd.DataFrame({'col1': ['app Banana', 'Cherry', 'banana', 'apple', 'bnapple', 'apple ch']})

Suppose I have df2 and df1 which have col1 values. df1 has correct full names but may have duplicates (which I need to keep in both dataframes due to other col data being different).

df2 has either bad names or incorrect case, which I want to correct. I want df2 column to reflect what df1 col says. the resulting column for df2 would be -

df2 = pd.DataFrame({'col1': ['banana', 'cherry', 'banana', 'apple', 'apple', 'apple']})

TIA!

I don't know how to. I tried merge query of dataframe but that just gave out everything in df1 + df2.

Solution

Craft a regex from df1 and extract from df2:

import re
pat = '(%s)' % '|'.join(map(re.escape, df1['col1'].unique()))
# '(apple|banana|cherry)'

df2['out'] = df2['col1'].str.extract(pat, flags=re.I, expand=False)

# variant to have lowercase
df2['out2'] = df2['col1'].str.casefold().str.extract(pat, flags=re.I, expand=False)

If you only want to match full words of df2 modify the regex to add word boundaries (\b):

pat = r'\b(%s)\b' % '|'.join(map(re.escape, df1['col1'].unique()))

Output:

         col1     out    out2
0  app Banana  Banana  banana
1      Cherry  Cherry  cherry
2      banana  banana  banana
3       apple   apple   apple
4     bnapple   apple   apple
5    apple ch   apple   apple