Search code examples
pythonpandastext

Pandas text pattern matching 80% to 90% and delete matched


I have a table with same product with different colors. The quantity is different for each. However, the format is so different for the product.

PROD                        QUAN
product abcdefg (white)     20
product abcdefg blue color  30
product abcdefg - pink      20
product zy                  80 

I want to -

  • Match atleast 80%-90% of the text in PROD and
  • Delete the ones with Higher Quantity
  • Keep only One of the three products and delete the rest

The final output expected is

PROD                        QUAN
product abcdefg (white)     20
product zy                  80 

Solution

  • Match atleast 80%-90% of the text in PROD and

    How about 70%?

    Try:

    import pandas as pd
    
    df = pd.DataFrame({'PROD': {0: 'product abcdefg (white)',
      1: 'product abcdefg blue color',
      2: 'product abcdefg - pink',
      3: 'product zy'},
     'QUAN': {0: 20, 1: 30, 2: 20, 3: 80}})
    
    
    # slice at (say 70%)...
    cutOff = round((len(df.loc[0,'PROD'])/100) * 70)
    
    df['PROD2'] = df['PROD'].str.slice(stop=cutOff)
    
    df = df[df['QUAN'] == df.groupby('PROD2')['QUAN'].transform('min')]
    
    df = df.drop_duplicates(subset=['PROD2','QUAN'], keep="first").reset_index(drop=True)
    
    del df['PROD2']
    
    print(df)
    
        PROD                        QUAN
    0   product abcdefg (white)     20
    1   product zy                  80