Search code examples
pythonstringpandascontains

Replace messy str with clean str from another dataframe


I have 2 sets of dataframe, and i want to clean df1['Fruits'] if it contains df2['Fruits'] string

df1
Name    Fruits
--------------
Dina    Pineapple, [Y*]
Maria   PTC*, Apple
Johny   Durian, 1-6
Johny   5,6 Rambutan
Maria   Apple (Red), [Y] *
Dina    [Y] *, Peach88
Dina    Kiwi/Qiwi, PS*

df2
Fruits      tag
-------------
Apple       20
Pineapple   30
Rambutan    40
Durian      50
Apple (Red) 25
Peach88     55
Kiwi/Qiwi   25

i've tried

df1.loc[df1['Fruits'].contains(df2['Fruits']),'Fruits'] = df2['Fruits']

but it shows

'Series' object has no attribute 'contains'

So what i expected to get is

df1
Name    Fruits
--------------
Dina    Pineapple
Maria   Apple
Johny   Durian
Johny   Rambutan
Maria   Apple (Red)
Dina    Peach88
Dina    Kiwi/Qiwi

Solution

  • Use pandas.Series.str.extract:

    reg = '(%s)' % '|'.join(df2['Fruits'])
    # Make regex expression using df2['Fruits']
    df1['Fruits'] = df1['Fruits'].str.extract(reg)
    

    Output:

        Name     Fruits
    0   Dina  Pineapple
    1  Maria      Apple
    2  Johny     Durian
    3  Johny   Rambutan
    

    Explanation of '(%s)' % '|'.join(df2['Fruits']):

    • '|'.join(df2['Fruits']): creates | separated words for or operation in regex. Returns Pineapple|Apple|Durian|Rambutan
    • (%s) % ... : This is called string formatting, and equivalent of:
      • str.format: '({})'.format('|'.join(df2['Fruits'])),
      • or more implicit (but less pythonic) '(' + '|'.join(df2['Fruits']) + ')'
      • all of which returns (Apple|Pineapple|Rambutan|Durian), a capture group, mandatory for pd.Series.str.extract to know what to extract.