Search code examples
pandasmatchisin

Check if substring is in a string in a different DF, if it is then return value from another row


I want to check is a substring from DF1 is in DF2. If it is I want to return a value of a corresponding row.

DF1

Name ID Region
John AAA A
John AAA B
Pat CCC C
Sandra CCC D
Paul DD E
Sandra R9D F
Mia dfg4 G
Kim asfdh5 H
Louise 45gh I

DF2

Name ID Company
John AAAxx1 Microsoft
John AAAxxREG1 Microsoft
Michael BBBER4 Microsoft
Pat CCCERG Dell
Pat CCCERGG Dell
Paul DFHDHF Facebook

Desired Output

Where ID from DF1 is in the ID column of DF2 I want to create a new column in DF1 that matches the company

Name ID Region Company
John AAA A Microsoft
John AAA B Microsoft
Pat CCC C Dell
Sandra CCC D
Paul DD E
Sandra R9D F
Mia dfg4 G
Kim asfdh5 H
Louise 45gh I

I have the below code that determines if the ID from DF1 is in DF2 however I'm not sure how I can bring in the company name.

DF1['Get company'] = np.in1d(DF1['ID'], DF2['ID'])


Solution

  • Try to find ID string from df1 into df2 then merge on this column:

    key = df2['ID'].str.extract(fr"({'|'.join(df1['ID'].values)})", expand=False)
    df1 = df1.merge(df2['Company'], left_on='ID', right_on=key, how='left').fillna('')
    print(df1)
    
    # Output:
        Name    ID    Company
    0   John   AAA           
    1  Peter   BAB  Microsoft
    2   Paul  CCHF     Google
    3  Rosie   R9D           
    

    Details: create a regex from df1['ID'] to extract partial string from df2['ID']:

    # Regex pattern: try to extract the following pattern
    >>> fr"({'|'.join(df1['ID'].values)})"
    '(AAA|BAB|CCHF|R9D)'
    
    # After extraction
    >>> pd.concat([df2['ID'], key], axis=1)
            ID    ID
    0    AEDSV   NaN  # Nothing was found
    1   123BAB   BAB  # Found partial string BAB
    2  CCHF-RB  CCHF  # Found partial string CCHF
    3     YYYY   NaN  # Nothing was found
    

    Update:

    To solve this I wonder is it possible to merge based on 2 columns. e.g merge on Name and ID?

    key = df2['ID'].str.extract(fr"({'|'.join(df1['ID'].values)})", expand=False)
    df1 = pd.merge(df1, df2[['Name', 'Company']], left_on=['Name', 'ID'], 
                   right_on=['Name', key], how='left').drop_duplicates().fillna('')
    print(df1)
    
    # Output:
          Name      ID Region    Company
    0     John     AAA      A  Microsoft
    2     John     AAA      B  Microsoft
    4      Pat     CCC      C       Dell
    6   Sandra     CCC      D           
    7     Paul      DD      E           
    8   Sandra     R9D      F           
    9      Mia    dfg4      G           
    10     Kim  asfdh5      H           
    11  Louise    45gh      I