Search code examples
pythonpandasstringdataframestring-comparison

Searching substring of a dataframe if it exists in another dataframe column


I need some help with searching a string or a substring in the chemicals column in dataframe1 and checking to see if it exists in dataframe2 then creating a new column in dataframe1 to return the corresponding chemical name column from dataframe2. Can anyone assist?

Thanks


Solution

  • I'm not 100% sure your question is clear, however I've made an attempt at what you're saying you're trying to do with an example. Here we search each element in df1, and then return a list of every match in df2 in a new column in df1. Let me know if this is what you're expecting:

    df1 = pd.DataFrame({'CHEMICALS': ['AAA', 'BBB', 'ccc'],
                       'label': [0.0, 1.0, 0.0]
                       })
    df2 = pd.DataFrame({'CHEMICALS': ['DDD', 'BBB_2', 'ccc_2', 'ccc_3'],
                       })
    
    for ind1 in df1.index:
        df1.loc[ind1, 'df2_match'] = ', '.join(list(df2[df2['CHEMICALS'].str.contains(df1['CHEMICALS'][ind1])]['CHEMICALS']))
    

    To break this up a little bit:

    x1 = df2['CHEMICALS'].str.contains(df1['CHEMICALS'][ind1])
    

    this returns a TRUE/FALSE series for if an item in df2 contains the string at position ind1 in df1.

    x2 = df2[x1]['CHEMICALS']
    

    This returns the name of each CHEMICAL in df2 at the positions specified by the TRUE/FALSE series.

    x3 = ', '.join(list(x2))
    

    This then turns those names as a list and then joins them together with a ', ' inbetween. This is then stuck in the new column of df1 at the correct index, and repeated for each chemical in df1.

    The output looks like this:

    df1
        CHEMICALS   label   df2_match
    0   AAA         0.0 
    1   BBB         1.0     BBB_2
    2   ccc         0.0     ccc_2, ccc_3