Search code examples
pythonpandasdata-analysis

Change string values with new values contain in another data frame


I have a csv with thousand of rows with sales data as follows:

pd.DataFrame({
    'Item_name': ['guacamole', 'morita', 'verde', 'pico', 'tomatillo'],
    'Inv_number': ['0001', '0002', '0003', '0004', '0005'],
    'Store_name': ['alex', 'pusateris', 'wholefoods','longos', 'metro']

Now the item names have changed to the following:

pd.DataFrame ({
'Item_name': ['Dip guacamole', 'morita Spicy', ' Salsa verde', 'Pico de Gallo', 'Roasted tomatillo']

What I am looking to achieve is to change the old name to the new. I am using the following code for each item, but this is going to take forever!

sales_df['item_code']= sales_df['item_code'].replace({'Guacamole':'Dip Guacamole'}) 

Is there a way to simplify this code? Maybe create a list with the new names and iterate through the sales data?

Looking forward to hearing your comments.

Thank you!


Solution

  • Use fuzzy logic here.

    # Python env: pip install thefuzz
    # Anaconda env: conda install thefuzz
    
    from thefuzz import process
    
    THRESHOLD = 90  # reject all values below this score (%)
    
    # df: your original dataframe
    # df1: your new names
    df['Item_name_new'] = \
        df['Item_name'].apply(lambda x: process.extractOne(x, df1['Item_name'],
                                  score_cutoff=THRESHOLD)).str[0]
    print(df)
    
    # Output
       Item_name Inv_number  Store_name      Item_name_new
    0  guacamole       0001        alex      Dip guacamole
    1     morita       0002   pusateris       morita Spicy
    2      verde       0003  wholefoods        Salsa verde
    3       pico       0004      longos      Pico de Gallo
    4  tomatillo       0005       metro  Roasted tomatillo
    5      water       0006      nature               None