I have 3 columns, Name, Date and Quantity for Table A, and Table B. And I add Column D, which is the result of the concatenated of Column Name, Date and Quantity. I would like to vlookup the Column D in Table A with Table B. If there is a match, the output would be 'Yes' at Column E; if no match, then 'No'.
If the output is 'No' in Column E, given Column D (concatenated), I would like to identify if Column Name, Date or/and Quantity is/are the reason of not match. For example, if Name is not match, return output in Column F as 'Not match name', otherwise return output as 'Match name'.
The problem I encountered right now is the output (match or no match) that I found for Name is correct, but not for Date and Quantity. I think it is mainly due to one-to-many, many-to-many relationship, in which there are multiple repetitive within Name, Date and Quantity.
My code is not consistent because I amended them from time to time due to the incorrect output especially for Date and Quantity.
This is what I tried so far:
#Concate the 3 columns
df2_A = df1_A.copy()
df2_A.loc[:, 'A_Concate'] = df2_A['Name'].astype(str) + df2_of01['Date'].astype(str) + df2_A['Quantity'].astype(str)
df2_B = df1_B.copy()
df2_B.loc[:, 'B_Concate'] = df2_B['Name'].astype(str) + df2_Name['Date'].astype(str) + df2_B['Quantity'].astype(int).astype(str)
#Vlookup concatenated column for Table A and B
df2_A ['Match with B?'] = df2_A ['A_Concate'].isin(df2_B['B_Concate']).map({True: 'Yes', False: 'No'})
#Find reason of not match
df2_A ['Match name?'] = df2_A .apply(lambda row: 'Not match name' if row['Match with B?'] == 'No' and row['name'] not in df2_B['Name'].unique() else 'Match name', axis=1)
df2_A ['Match date?'] = df2_A .apply(lambda row: 'Match date' if row['Match with B?'] == 'Yes' else ('Not match date' if row['Date'] not in df2_B.loc[df2_B['B_Concate'] == row['A_Concate'], 'Date'].values else 'Match date'), axis=1)
df2_A ['Match quantity?'] = df2_A .apply(lambda row: 'Not match quantity' if row['Match with B?'] == 'No' and row['Match part?'] == 'Not match part' else ('Not match quantity' if row['Match with B?'] == 'No' and row['SUGGESTED QTY'] not in df2_B['Quantity'].unique() else 'Match quantity'), axis=1)
Which part can be improved so that the output can be returned based on the concatenated rows?
IIUC, you can use merge
this way :
out = (pd.merge(df2_A, df2_B, on=list(df2_A.columns), how="left", indicator="Match with B?")
.replace({"Match with B?": {"both": "Yes", "left_only": "No"}}))
out["Why ?"] = (pd.concat([pd.merge(df2_A[[col]].drop_duplicates(), df2_B[[col]].drop_duplicates(),
on=col, how="left", indicator=f"check_{i}")
for i, col in enumerate(df2_A.columns)], axis=1).filter(like="check")
.set_axis(df2_A.columns, axis=1).replace({"both": True, "left_only": False})
.apply(lambda x: np.where(x.eq(False), x.name, None)).stack().groupby(level=0).agg(list)
)
Output :
print(out)
Name Date Quantity Match with B? Why ?
0 foo 2023-02-11 1 No [Date, Quantity]
1 bar 2023-03-22 2 Yes NaN
2 baz 2023-01-05 3 No [Name, Date, Quantity]
3 qux 2023-04-18 4 No [Name, Date]
4 bar 2023-05-01 5 No [Date]
If you want to expose the match check of each column, use this :
tmp = (pd.merge(df2_A, df2_B, on=list(df2_A.columns), how="left", indicator="Match with B?")
.replace({"Match with B?": {"both": "Yes", "left_only": "No"}}))
out = tmp.join(pd.concat([pd.merge(df2_A[[col]].drop_duplicates(), df2_B[[col]].drop_duplicates(),
on=col, how="left", indicator=f"check_{i}")
for i, col in enumerate(df2_A.columns)], axis=1).filter(like="check")
.set_axis(df2_A.columns, axis=1).replace({"both": True, "left_only": False})
.add_prefix("Match ").add_suffix(" ?").replace({True: "Yes", False: "No"}).fillna("Yes")
)
Output :
print(out)
Name Date Quantity Match with B? Match Name ? Match Date ? Match Quantity ?
0 foo 2023-02-11 1 No Yes No No
1 bar 2023-03-22 2 Yes Yes Yes Yes
2 baz 2023-01-05 3 No No No No
3 qux 2023-04-18 4 No No No Yes
4 bar 2023-05-01 5 No Yes No Yes
With highlights to see the results :
Inputs used :
df2_A = pd.DataFrame({
"Name": ["foo", "bar", "baz", "qux", "bar"],
"Date": ["2023-02-11", "2023-03-22", "2023-01-05", "2023-04-18", "2023-05-01"],
"Quantity": [1, 2, 3, 4, 5]
})
df2_B = pd.DataFrame({
"Name": ["foo", "xyz", "foo", "bar"],
"Date": ["2023-02-30", "2023-02-25", "2023-03-10", "2023-03-22"],
"Quantity": [5, 4, 6, 2]
})