Search code examples
pythonpandasdataframegroup-bymerge

Pandas create a new column based on exact match of text values


I have two dataframes that look like below

proj_df = pd.DataFrame({'reg_id':[1,2,3,4],
                        'part_no':['P1','P2','P3','P4'],
                        'partner':['A','B','C','D'],
                        'cust_name_1': ['ABC PVT LTD','Tesla','Apple','Google'],
                        'cust_name_2':['ABC','Tesla Ltd','Apple Inc','Google Enterprises'],
                        'cust_name_3':['ABC','Tesla America','Apple America','Google Ent Pvt ltd']})


data_df = pd.DataFrame({'cust_name': ['ABC','Tesla America','Apple Inc','Google','Google','ABC'],
                        'partner':['A','B','C','D','E','A'],
                        'part_no':['P1','P2','P3','P4','P5','P6'],
                        'qty':[100,100,600,150,320,410]})

I would like to do the below

a) Identify the exactly matching customer name column from proj_df by comparing it with data_df.

b) Ex: select one exactly matching column from cust_name_1, cust_name_2, cust_name_3 by comparing it with cust_name column from data_df. If two or more columns has 100% match, then choose any one of the columns.

b) Compare/merge both dataframes based on matching part_no, partner columns.

I tried the below but it is going nowhere

unique_names = data_df['cust_name'].tolist()
for name in unique_names:
    proj_df['similarity_ratio'] = proj_df.apply(lambda x: difflib.SequenceMatcher(None, name, x.name_1).ratio(), axis=1)

I expect my output to be like as below

enter image description here


Solution

  • If need 100% match use DataFrame.melt with left join, then compare both columns for exact match and for one row per part_no/partner add DataFrame.drop_duplicates:

    df = (proj_df.melt(['reg_id','part_no','partner'],
                      var_name='matching_column_name', 
                      value_name='matching_column_value')
                 .merge(data_df, on=['part_no','partner']))
    
    df = (df[df['matching_column_value'].eq(df['cust_name'])]
               .drop_duplicates(['part_no','partner']))
    

    Last for matching by original proj_df use left join only by necesary columns:

    need = ['part_no','partner', 'matching_column_name','matching_column_value','qty']
    
    df = proj_df.merge(df[need], how='left')
    print (df)
       reg_id part_no partner  cust_name_1         cust_name_2  \
    0       1      P1       A  ABC PVT LTD                 ABC   
    1       2      P2       B        Tesla           Tesla Ltd   
    2       3      P3       C        Apple           Apple Inc   
    3       4      P4       D       Google  Google Enterprises   
    
              cust_name_3 matching_column_name matching_column_value  qty  
    0                 ABC          cust_name_2                   ABC  100  
    1       Tesla America          cust_name_3         Tesla America  100  
    2       Apple America          cust_name_2             Apple Inc  600  
    3  Google Ent Pvt ltd          cust_name_1                Google  150  
    

    If need maximal similar match solution use SequenceMatcher for ratio and for first maximal ratio rows use DataFrameGroupBy.idxmax:

    df = (proj_df.melt(['reg_id','part_no','partner'], 
                       var_name='matching_column_name', 
                       value_name='matching_column_value')
                  .merge(data_df, on=['part_no','partner'])
    df['similarity_ratio'] = [difflib.SequenceMatcher(None, a, b).ratio() 
                               for a, b in zip(df['matching_column_value'], df['cust_name'])]
    
    df = df.loc[df.groupby(['part_no','partner'])['similarity_ratio'].idxmax()]
    print (df)
    
    need = ['part_no','partner', 'matching_column_name','matching_column_value','qty']
    
    df = proj_df.merge(df[need], how='left')
    print (df)
       reg_id part_no partner  cust_name_1         cust_name_2  \
    0       1      P1       A  ABC PVT LTD                 ABC   
    1       2      P2       B        Tesla           Tesla Ltd   
    2       3      P3       C        Apple           Apple Inc   
    3       4      P4       D       Google  Google Enterprises   
    
              cust_name_3 matching_column_name matching_column_value  qty  
    0                 ABC          cust_name_2                   ABC  100  
    1       Tesla America          cust_name_3         Tesla America  100  
    2       Apple America          cust_name_2             Apple Inc  600  
    3  Google Ent Pvt ltd          cust_name_1                Google  150