Search code examples
pandasnlplemmatization

lemmatization or normalization using a dictionary and list of variations


I have a pandas data frame with string column which is a transaction string column. I am trying to some manual lemmatization. I have manually created a dictionary which has the main word as the key and a list of variations of the words as the values. I would like to substitute the words in the list with the main word.

here is the example code of the data I have.

import pandas as pd
list1 = ['0412 UBER TRIP HELP.UBER.COMCA',
'0410 UBER TRIP HELP.UBER.COMCA',
'MOBILE PURCHASE 0410 VALENCIA WHOLE FOODS SAN FRANCISCOCA',
'WHOLEFDS WBG#1 04/13 PURCHASE WHOLEFDS WBG#104 BROOKLYN NY',
'0414 LYFT *CITI BIKE BIK LYFT.COM CA',
'0421 WALGREENS.COM 877-250-5823 IL',
'0421 Rapha Racing PMT LLC XXX-XX72742 OR',
'0422 UBER EATS PAYMENT HELP.UBER.COMCA',
'0912 WHOLEFDS NOE 10379 SAN FRANCISCOCA',
'PURCHASE 1003 CAVIAR*JUNOON WWW.DOORDASH.CA']
df = pd.DataFrame(list1, columns = ['feature'])

map1 = {'payment':['pmts','pmnt','pmt','pmts','pyment','pymnts'],
'account':['acct'],
 'pharmacy':['walgreens','walgreen','riteaid','cvs','pharm'],
 'food_delivery':['uber eats','doordash','seamless','grubhub','caviar'],
 'ride_share':['uber','lyft'],
 'whole_foods':['wholefds','whole foods','whole food']
}

I know how to do it one word at a time using df['feature'].str.replace('variation','main word'). However, this is laborious and time consuming. Is there a faster way to do this? Thank you.


Solution

  • Reverse your map:

    reverse_map1 = {rf'(?i)\b{v}\b': k for k, l in map1.items() for v in l}
    df['feature'] = df['feature'].replace(reverse_map1, regex=True)
    

    Output:

    >>> df
                                                                feature
    0                        0412 ride_share TRIP HELP.ride_share.COMCA
    1                        0410 ride_share TRIP HELP.ride_share.COMCA
    2         MOBILE PURCHASE 0410 VALENCIA whole_foods SAN FRANCISCOCA
    3  whole_foods WBG#1 04/13 PURCHASE whole_foods WBG#104 BROOKLYN NY
    4                  0414 ride_share *CITI BIKE BIK ride_share.COM CA
    5                                 0421 pharmacy.COM 877-250-5823 IL
    6                      0421 Rapha Racing payment LLC XXX-XX72742 OR
    7                  0422 food_delivery PAYMENT HELP.ride_share.COMCA
    8                        0912 whole_foods NOE 10379 SAN FRANCISCOCA
    9           PURCHASE 1003 food_delivery*JUNOON WWW.food_delivery.CA
    

    Details:

    >>> reverse_map1
    {'(?i)\\bpmts\\b': 'payment',
     '(?i)\\bpmnt\\b': 'payment',
     '(?i)\\bpmt\\b': 'payment',
     '(?i)\\bpyment\\b': 'payment',
     '(?i)\\bpymnts\\b': 'payment',
     '(?i)\\bacct\\b': 'account',
     '(?i)\\bwalgreens\\b': 'pharmacy',
     '(?i)\\bwalgreen\\b': 'pharmacy',
     '(?i)\\briteaid\\b': 'pharmacy',
     '(?i)\\bcvs\\b': 'pharmacy',
     '(?i)\\bpharm\\b': 'pharmacy',
     '(?i)\\buber eats\\b': 'food_delivery',
     '(?i)\\bdoordash\\b': 'food_delivery',
     '(?i)\\bseamless\\b': 'food_delivery',
     '(?i)\\bgrubhub\\b': 'food_delivery',
     '(?i)\\bcaviar\\b': 'food_delivery',
     '(?i)\\buber\\b': 'ride_share',
     '(?i)\\blyft\\b': 'ride_share',
     '(?i)\\bwholefds\\b': 'whole_foods',
     '(?i)\\bwhole foods\\b': 'whole_foods',
     '(?i)\\bwhole food\\b': 'whole_foods'}
    
    • (?i): case insensitive
    • \b...\b: word boundary

    Update

    If you don't care about the lower/upper case, you can use:

    reverse_map1 = {rf'\b{v}\b': k for k, l in map1.items() for v in l}
    df['feature'] = df['feature'].str.lower().replace(reverse_map1, regex=True)