I'm new to python and struck in a problem.
df1 = pd.DataFrame({'col1': ['apple', 'banana', 'cherry', 'apple', 'cherry']})
df2 = pd.DataFrame({'col1': ['app Banana', 'Cherry', 'banana', 'apple', 'bnapple', 'apple ch']})
Suppose I have df2 and df1 which have col1 values. df1 has correct full names but may have duplicates (which I need to keep in both dataframes due to other col data being different).
df2 has either bad names or incorrect case, which I want to correct. I want df2 column to reflect what df1 col says. the resulting column for df2 would be -
df2 = pd.DataFrame({'col1': ['banana', 'cherry', 'banana', 'apple', 'apple', 'apple']})
TIA!
I don't know how to. I tried merge query of dataframe but that just gave out everything in df1 + df2.
Craft a regex from df1
and extract
from df2
:
import re
pat = '(%s)' % '|'.join(map(re.escape, df1['col1'].unique()))
# '(apple|banana|cherry)'
df2['out'] = df2['col1'].str.extract(pat, flags=re.I, expand=False)
# variant to have lowercase
df2['out2'] = df2['col1'].str.casefold().str.extract(pat, flags=re.I, expand=False)
If you only want to match full words of df2
modify the regex to add word boundaries (\b
):
pat = r'\b(%s)\b' % '|'.join(map(re.escape, df1['col1'].unique()))
Output:
col1 out out2
0 app Banana Banana banana
1 Cherry Cherry cherry
2 banana banana banana
3 apple apple apple
4 bnapple apple apple
5 apple ch apple apple