Search code examples
pythonpandasmergeleft-join

Python Pandas: Join on case-insensitive and stripped key


Left Dataframe MASTER =

Fname Lname Amount
John Smith-Richards
David O'Brien

Right Dataframe ORDERS =

Fname Lname Amount
DAVID OBRIEN 36
john smith richards 11

Expected output Dataframe =

Fname Lname Amount
John Smith-Richards 11
David O'Brien 36

The tables MASTER and ORDERS are LEFT-OUTER joined with KEY = [Fname, Lname]

Both tables have the same columns but the rows are not in the same order. I want the join operation to ignore case and ignore characters like dashes, spaces and apostrophes (e.g. the tables would be joined on the values "johnsmithrichards" and "davidobrien")

(In practice, the actual tables have many more columns and the resulting output table must pull some values from the MASTER table and others from the ORDERS table, which is why I used mdfcols and odfcols respectively to only include those specific columns.)

My attempt at removing just spaces doesn't work:

key = ["Fname", "Lname"]

mdfkey = (mdf["First Name"].str.lower() + mdf["Last Name"].str.lower()).replace(' ', '')
mdfcols = ["Fname", "Lname"]

odfkey = (odf["First Name"].str.lower() + odf["Last Name"].str.lower()).replace(' ', '')
odfCols = ["Amount"]

outputdf = pd.merge(
    mdf.loc[:, mdfcols],
    odf.loc[:, odfCols],
    how='left', left_on=mdfkey, right_on=odfkey)

Solution

  • Use Series.str.replace for remove non alpha numeric values for both DataFrames, last remove helper column key_0:

    key = ["Fname", "Lname"]
    
    mdfkey = (mdf["Fname"] + mdf["Lname"]).str.replace('[^a-zA-Z]', '', regex=True).str.lower()
    mdfcols = ["Fname", "Lname"]
    
    
    odfkey = (odf["Fname"] + odf["Lname"]).str.replace('[^a-zA-Z]', '', regex=True).str.lower()
    odfCols = ["Amount"]
    
    outputdf = pd.merge(
        mdf.loc[:, mdfcols],
        odf.loc[:, odfCols],
        how='left', left_on=mdfkey, right_on=odfkey).drop('key_0', axis=1)
    
    print (outputdf)
       Fname           Lname  Amount
    0   John  Smith-Richards      11
    1  David         O'Brien      36
    

    Solution with helper columns:

    merging = ['merge']
    key = ["Fname", "Lname"]
    
    mdf['merge'] = ((mdf["Fname"] + mdf["Lname"]).str.replace('[^a-zA-Z]', '', regex=True)
                                                 .str.lower())
    mdfcols = ["Fname", "Lname"]
    
    
    odf['merge'] = ((odf["Fname"] + odf["Lname"]).str.replace('[^a-zA-Z]', '', regex=True)
                                                 .str.lower())
    odfCols = ["Amount"]
    
    outputdf = pd.merge(
        mdf.loc[:, mdfcols + merging],
        odf.loc[:, odfCols + merging],
        how='left', on=merging).drop(merging, axis=1)
    
    print (outputdf)
       Fname           Lname  Amount
    0   John  Smith-Richards      11
    1  David         O'Brien      36