I have a csv with thousand of rows with sales data as follows:
pd.DataFrame({
'Item_name': ['guacamole', 'morita', 'verde', 'pico', 'tomatillo'],
'Inv_number': ['0001', '0002', '0003', '0004', '0005'],
'Store_name': ['alex', 'pusateris', 'wholefoods','longos', 'metro']
Now the item names have changed to the following:
pd.DataFrame ({
'Item_name': ['Dip guacamole', 'morita Spicy', ' Salsa verde', 'Pico de Gallo', 'Roasted tomatillo']
What I am looking to achieve is to change the old name to the new. I am using the following code for each item, but this is going to take forever!
sales_df['item_code']= sales_df['item_code'].replace({'Guacamole':'Dip Guacamole'})
Is there a way to simplify this code? Maybe create a list with the new names and iterate through the sales data?
Looking forward to hearing your comments.
Thank you!
Use fuzzy logic here.
# Python env: pip install thefuzz
# Anaconda env: conda install thefuzz
from thefuzz import process
THRESHOLD = 90 # reject all values below this score (%)
# df: your original dataframe
# df1: your new names
df['Item_name_new'] = \
df['Item_name'].apply(lambda x: process.extractOne(x, df1['Item_name'],
score_cutoff=THRESHOLD)).str[0]
print(df)
# Output
Item_name Inv_number Store_name Item_name_new
0 guacamole 0001 alex Dip guacamole
1 morita 0002 pusateris morita Spicy
2 verde 0003 wholefoods Salsa verde
3 pico 0004 longos Pico de Gallo
4 tomatillo 0005 metro Roasted tomatillo
5 water 0006 nature None