I have a pandas data frame with string column which is a transaction string column. I am trying to some manual lemmatization. I have manually created a dictionary which has the main word as the key and a list of variations of the words as the values. I would like to substitute the words in the list with the main word.
here is the example code of the data I have.
import pandas as pd
list1 = ['0412 UBER TRIP HELP.UBER.COMCA',
'0410 UBER TRIP HELP.UBER.COMCA',
'MOBILE PURCHASE 0410 VALENCIA WHOLE FOODS SAN FRANCISCOCA',
'WHOLEFDS WBG#1 04/13 PURCHASE WHOLEFDS WBG#104 BROOKLYN NY',
'0414 LYFT *CITI BIKE BIK LYFT.COM CA',
'0421 WALGREENS.COM 877-250-5823 IL',
'0421 Rapha Racing PMT LLC XXX-XX72742 OR',
'0422 UBER EATS PAYMENT HELP.UBER.COMCA',
'0912 WHOLEFDS NOE 10379 SAN FRANCISCOCA',
'PURCHASE 1003 CAVIAR*JUNOON WWW.DOORDASH.CA']
df = pd.DataFrame(list1, columns = ['feature'])
map1 = {'payment':['pmts','pmnt','pmt','pmts','pyment','pymnts'],
'account':['acct'],
'pharmacy':['walgreens','walgreen','riteaid','cvs','pharm'],
'food_delivery':['uber eats','doordash','seamless','grubhub','caviar'],
'ride_share':['uber','lyft'],
'whole_foods':['wholefds','whole foods','whole food']
}
I know how to do it one word at a time using df['feature'].str.replace('variation','main word')
. However, this is laborious and time consuming. Is there a faster way to do this? Thank you.
Reverse your map:
reverse_map1 = {rf'(?i)\b{v}\b': k for k, l in map1.items() for v in l}
df['feature'] = df['feature'].replace(reverse_map1, regex=True)
Output:
>>> df
feature
0 0412 ride_share TRIP HELP.ride_share.COMCA
1 0410 ride_share TRIP HELP.ride_share.COMCA
2 MOBILE PURCHASE 0410 VALENCIA whole_foods SAN FRANCISCOCA
3 whole_foods WBG#1 04/13 PURCHASE whole_foods WBG#104 BROOKLYN NY
4 0414 ride_share *CITI BIKE BIK ride_share.COM CA
5 0421 pharmacy.COM 877-250-5823 IL
6 0421 Rapha Racing payment LLC XXX-XX72742 OR
7 0422 food_delivery PAYMENT HELP.ride_share.COMCA
8 0912 whole_foods NOE 10379 SAN FRANCISCOCA
9 PURCHASE 1003 food_delivery*JUNOON WWW.food_delivery.CA
Details:
>>> reverse_map1
{'(?i)\\bpmts\\b': 'payment',
'(?i)\\bpmnt\\b': 'payment',
'(?i)\\bpmt\\b': 'payment',
'(?i)\\bpyment\\b': 'payment',
'(?i)\\bpymnts\\b': 'payment',
'(?i)\\bacct\\b': 'account',
'(?i)\\bwalgreens\\b': 'pharmacy',
'(?i)\\bwalgreen\\b': 'pharmacy',
'(?i)\\briteaid\\b': 'pharmacy',
'(?i)\\bcvs\\b': 'pharmacy',
'(?i)\\bpharm\\b': 'pharmacy',
'(?i)\\buber eats\\b': 'food_delivery',
'(?i)\\bdoordash\\b': 'food_delivery',
'(?i)\\bseamless\\b': 'food_delivery',
'(?i)\\bgrubhub\\b': 'food_delivery',
'(?i)\\bcaviar\\b': 'food_delivery',
'(?i)\\buber\\b': 'ride_share',
'(?i)\\blyft\\b': 'ride_share',
'(?i)\\bwholefds\\b': 'whole_foods',
'(?i)\\bwhole foods\\b': 'whole_foods',
'(?i)\\bwhole food\\b': 'whole_foods'}
(?i)
: case insensitive\b...\b
: word boundaryUpdate
If you don't care about the lower/upper case, you can use:
reverse_map1 = {rf'\b{v}\b': k for k, l in map1.items() for v in l}
df['feature'] = df['feature'].str.lower().replace(reverse_map1, regex=True)