Search code examples
pythonpandasnlpstring-matching

print strings of one dataframe contained in another dataframe


I have two dataframes: one dataframe consists of two columns ('good' and bad') and another one that contains text data.

df_dictionary = pd.DataFrame({'good': ['love', 'like'],
                    'bad': ['dislike', 'hate']})
df_text = pd.DataFrame({'col1': ['i love cats', 'i hate dogs']})

Now I would like to retrieve exact string matches of words that are in the dictionary and are contained in col1 of df_text and assign the string match to the second column of df_text.

I tried .isin(), however this code only shows exact string matches if the whole phrase matches and not if the word is contained in the sentence.

df_text should then look as follows:

col1 string_match_good string_match_bad
i love cats love
i hate dogs hate

I do not want partial string matches, e.g. if col1 says 'i loved cats', then I do not want a string match.

I found the following: matches = df_text[df_text['col1'].str.contains(fr"\b(?:{'|'.join(df_dictionary)})\b")] , however this one does not print the matched words (i.e. good or bad) in the string_match columns.

Does anyone have a solution to it?


Solution

  • I think the data structure is not ideal, specifically because your text values are conceptually several values in one (i.e., lists of tokens/words) but pandas works best with one value per cell. Here's how I'd approach it:

    1. Explode the strings such that you get one word per cell.
    df_text = (
               df_text.col1.str.split() # split into single words
               .explode() # explode them to one word per cell
               .rename_axis("sent_index") # rename the index for later
               .reset_index() # set the sent_index as its own column
               )
    

    Intermediary result:

       sent_index  col1
    0           0     i
    1           0  love
    2           0  cats
    3           1     i
    4           1  hate
    5           1  dogs
    
    1. Now you can merge col1 with df_dictionary, once for each of the two labels good and bad:
    for label in ["good", "bad"]:
        df_text = df_text.merge(df_dictionary[label], 
                                left_on="col1", 
                                right_on=label, 
                                how="left")
    

    Now df_text looks like this:

       sent_index  col1  good   bad
    0           0     i   NaN   NaN
    1           0  love  love   NaN
    2           0  cats   NaN   NaN
    3           1     i   NaN   NaN
    4           1  hate   NaN  hate
    5           1  dogs   NaN   NaN
    

    AFAICT, this should already contain all the information you need.

    1. Re-combine the words into sentences, using the sent_index we set earlier.
    df_final = (df_text.groupby("sent_index")
                .agg(list)
                .applymap(lambda s: ' '.join(w for w in s if not pd.isna(w)))
               )
    

    The final result then is:

                       col1  good   bad
    sent_index                         
    0           i love cats  love      
    1           i hate dogs        hate
    

    Note that in case of multiple matches, you'd get the labels as joined strings, too. E.g., I dislike dogs but don't hate them would occur as 'dislike hate' in the bad column. Whether or not that's alright depends on your next steps. Note that this is no problem for the data structure in step 2.