I have two dataframes that look like below
proj_df = pd.DataFrame({'reg_id':[1,2,3,4],
'part_no':['P1','P2','P3','P4'],
'partner':['A','B','C','D'],
'cust_name_1': ['ABC PVT LTD','Tesla','Apple','Google'],
'cust_name_2':['ABC','Tesla Ltd','Apple Inc','Google Enterprises'],
'cust_name_3':['ABC','Tesla America','Apple America','Google Ent Pvt ltd']})
data_df = pd.DataFrame({'cust_name': ['ABC','Tesla America','Apple Inc','Google','Google','ABC'],
'partner':['A','B','C','D','E','A'],
'part_no':['P1','P2','P3','P4','P5','P6'],
'qty':[100,100,600,150,320,410]})
I would like to do the below
a) Identify the exactly matching customer name column from proj_df
by comparing it with data_df
.
b) Ex: select one exactly matching column from cust_name_1
, cust_name_2
, cust_name_3
by comparing it with cust_name
column from data_df. If two or more columns has 100% match, then choose any one of the columns.
b) Compare/merge both dataframes based on matching part_no
, partner
columns.
I tried the below but it is going nowhere
unique_names = data_df['cust_name'].tolist()
for name in unique_names:
proj_df['similarity_ratio'] = proj_df.apply(lambda x: difflib.SequenceMatcher(None, name, x.name_1).ratio(), axis=1)
I expect my output to be like as below
If need 100% match use DataFrame.melt
with left join, then compare both columns for exact match and for one row per part_no/partner
add DataFrame.drop_duplicates
:
df = (proj_df.melt(['reg_id','part_no','partner'],
var_name='matching_column_name',
value_name='matching_column_value')
.merge(data_df, on=['part_no','partner']))
df = (df[df['matching_column_value'].eq(df['cust_name'])]
.drop_duplicates(['part_no','partner']))
Last for matching by original proj_df
use left join only by necesary columns:
need = ['part_no','partner', 'matching_column_name','matching_column_value','qty']
df = proj_df.merge(df[need], how='left')
print (df)
reg_id part_no partner cust_name_1 cust_name_2 \
0 1 P1 A ABC PVT LTD ABC
1 2 P2 B Tesla Tesla Ltd
2 3 P3 C Apple Apple Inc
3 4 P4 D Google Google Enterprises
cust_name_3 matching_column_name matching_column_value qty
0 ABC cust_name_2 ABC 100
1 Tesla America cust_name_3 Tesla America 100
2 Apple America cust_name_2 Apple Inc 600
3 Google Ent Pvt ltd cust_name_1 Google 150
If need maximal similar match solution use SequenceMatcher
for ratio
and for first maximal ratio
rows use DataFrameGroupBy.idxmax
:
df = (proj_df.melt(['reg_id','part_no','partner'],
var_name='matching_column_name',
value_name='matching_column_value')
.merge(data_df, on=['part_no','partner'])
df['similarity_ratio'] = [difflib.SequenceMatcher(None, a, b).ratio()
for a, b in zip(df['matching_column_value'], df['cust_name'])]
df = df.loc[df.groupby(['part_no','partner'])['similarity_ratio'].idxmax()]
print (df)
need = ['part_no','partner', 'matching_column_name','matching_column_value','qty']
df = proj_df.merge(df[need], how='left')
print (df)
reg_id part_no partner cust_name_1 cust_name_2 \
0 1 P1 A ABC PVT LTD ABC
1 2 P2 B Tesla Tesla Ltd
2 3 P3 C Apple Apple Inc
3 4 P4 D Google Google Enterprises
cust_name_3 matching_column_name matching_column_value qty
0 ABC cust_name_2 ABC 100
1 Tesla America cust_name_3 Tesla America 100
2 Apple America cust_name_2 Apple Inc 600
3 Google Ent Pvt ltd cust_name_1 Google 150