Search code examples
pythonmatchmatching

'Vlookup' and return 'Match or Not Match' based on corresponding rows using Python


I have 3 columns, Name, Date and Quantity for Table A, and Table B. And I add Column D, which is the result of the concatenated of Column Name, Date and Quantity. I would like to vlookup the Column D in Table A with Table B. If there is a match, the output would be 'Yes' at Column E; if no match, then 'No'.

If the output is 'No' in Column E, given Column D (concatenated), I would like to identify if Column Name, Date or/and Quantity is/are the reason of not match. For example, if Name is not match, return output in Column F as 'Not match name', otherwise return output as 'Match name'.

The problem I encountered right now is the output (match or no match) that I found for Name is correct, but not for Date and Quantity. I think it is mainly due to one-to-many, many-to-many relationship, in which there are multiple repeatitive within Name, Date and Quantity.

My code is not consistent because I amended them from time to time due to the inccorect output especially for Date and Quantity. This is what I tried so far:

#Concate the 3 columns

df2_A = df1_A.copy()
df2_A.loc[:, 'A_Concate'] = df2_A['Name'].astype(str) + df2_of01['Date'].astype(str) + df2_A['Quantity'].astype(str)

df2_B = df1_B.copy()
df2_B.loc[:, 'B_Concate'] = df2_B['Name'].astype(str) + df2_Name['Date'].astype(str) + df2_B['Quantity'].astype(int).astype(str)

#Vlookup concatenated column for Table A and B
df2_A ['Match with B?'] = df2_A ['A_Concate'].isin(df2_B['B_Concate']).map({True: 'Yes', False: 'No'})

#Find reason of not match
df2_A ['Match name?'] = df2_A .apply(lambda row: 'Not match name' if row['Match with B?'] == 'No' and row['name'] not in df2_B['Name'].unique() else 'Match name', axis=1)

df2_A ['Match date?'] = df2_A .apply(lambda row: 'Match date' if row['Match with B?'] == 'Yes' else ('Not match date' if row['Date'] not in df2_B.loc[df2_B['B_Concate'] == row['A_Concate'], 'Date'].values else 'Match date'), axis=1)

df2_A ['Match quantity?'] = df2_A .apply(lambda row: 'Not match quantity' if row['Match with B?'] == 'No' and row['Match part?'] == 'Not match part' else ('Not match quantity' if row['Match with B?'] == 'No' and row['SUGGESTED QTY'] not in df2_B['Quantity'].unique() else 'Match quantity'), axis=1)

Which part can be improved so that the output can be returned based on the concatenated rows?


Solution

  • IIUC, you can use merge this way :

    out = (pd.merge(df2_A, df2_B, on=list(df2_A.columns), how="left", indicator="Match with B?")
               .replace({"Match with B?": {"both": "Yes", "left_only": "No"}}))
    
    out["Why ?"] = (pd.concat([pd.merge(df2_A[[col]].drop_duplicates(), df2_B[[col]].drop_duplicates(),
                                        on=col, how="left", indicator=f"check_{i}")
                               for i, col in enumerate(df2_A.columns)], axis=1).filter(like="check")
                       .set_axis(df2_A.columns, axis=1).replace({"both": True, "left_only": False})
                       .apply(lambda x: np.where(x.eq(False), x.name, None)).stack().groupby(level=0).agg(list)
          )
    

    Output :

    print(out)
    ​
      Name        Date  Quantity Match with B?                   Why ?
    0  foo  2023-02-11         1            No        [Date, Quantity]
    1  bar  2023-03-22         2           Yes                     NaN
    2  baz  2023-01-05         3            No  [Name, Date, Quantity]
    3  qux  2023-04-18         4            No            [Name, Date]
    4  bar  2023-05-01         5            No                  [Date]
    

    If you want to expose the match check of each column, use this :

    tmp = (pd.merge(df2_A, df2_B, on=list(df2_A.columns), how="left", indicator="Match with B?")
               .replace({"Match with B?": {"both": "Yes", "left_only": "No"}}))
    
    out = tmp.join(pd.concat([pd.merge(df2_A[[col]].drop_duplicates(), df2_B[[col]].drop_duplicates(),
                                        on=col, how="left", indicator=f"check_{i}")
                               for i, col in enumerate(df2_A.columns)], axis=1).filter(like="check")
                       .set_axis(df2_A.columns, axis=1).replace({"both": True, "left_only": False})
                       .add_prefix("Match ").add_suffix(" ?").replace({True: "Yes", False: "No"}).fillna("Yes")
          )
    

    ​ Output :

    print(out)
    
      Name        Date  Quantity Match with B? Match Name ? Match Date ? Match Quantity ?
    0  foo  2023-02-11         1            No          Yes           No               No
    1  bar  2023-03-22         2           Yes          Yes          Yes              Yes
    2  baz  2023-01-05         3            No           No           No               No
    3  qux  2023-04-18         4            No           No           No              Yes
    4  bar  2023-05-01         5            No          Yes           No              Yes
    

    With highlights to see the results :

    enter image description here

    Inputs used :

    df2_A = pd.DataFrame({
        "Name": ["foo", "bar", "baz", "qux", "bar"],
        "Date": ["2023-02-11", "2023-03-22", "2023-01-05", "2023-04-18", "2023-05-01"],
        "Quantity": [1, 2, 3, 4, 5]
    })
    
    df2_B = pd.DataFrame({
        "Name": ["foo", "xyz", "foo", "bar"],
        "Date": ["2023-02-30", "2023-02-25", "2023-03-10", "2023-03-22"],
        "Quantity": [5, 4, 6, 2]
    })