Search code examples
pandasdataframelambdazipiterable

Python filter dataframe by iterating through list of words in one column and returning rows where any word from list can be found in another column


I am working with one large dataframe which contains multiple columns

Cutdown version of the data is as follows (the ROW column has only been inserted to demonstrate issues I've experienced, explained further below)

df1

ROW search_words text_to_search
1 currency,CrossCurrencySwap,COPTIONS text1 text2, "text3": TEXT4; Currency
2 Exotic Bond text1, text 2 / text 3 'Exotic Bond'
3 JPN,Japan Japan, Asia, JPN, EMEA, US
4 BasisSwap,basis swap,swap EMEA
5 EMEA US, Japan
6 irs,interestrateswap,interest rate swap [NULL]
7 forward FRAforwardrateagreementforwardbondtrade
8 equity swap, equityswap equity, warrant, convertible, swap

The goal is to filter dataframe so only those rows where words in the search_words column that exist in text_to_search column are contained within it

Once filtered, the dataframe should look like this:

df1

search_words text_to_search
currency,CrossCurrencySwap,COPTIONS text1 text2, "text3": TEXT4; Currency
Exotic Bond text1, text 2 / text 3 'Exotic Bond'
JPN,Japan Japan, Asia, JPN, EMEA, US
forward FRAforwardrateagreementforwardbondtrade

As shown, the search_words column may contain one or more word on which to search. Where search_words contains a phrase (as in ROW 8 "equity swap") the WHOLE phrase must exist in text_to_search in order to find a match.

The text_to_search column has no discernible format so I cannot split the text into a list of words to use to compare against each word from the search_words column.

I started using the following line of code:

df1 = df1[df1.apply(lambda x: x.search_words in x['text_to_search'], axis=1)]

This returns no correct results.

I then tried reformatting the search_words column so that instead of a ',' delimiter, the words are separated by a pipe '|' so that the data looks like this: currency|CrossCurrencySwap|COPTIONS (and so on...)

And using the following code to try to establish a match:

df1 = df1.loc[df1['text_to_search'].str.contains('|'.join(map(str, df1['search_words']))))]

This line of code doesn't work either. It appears to join all words in the search_words column together and match all search_words to each line of text in the text_to_search column. In this way, I am getting false matches. For example ROW 5 is returned as a match because 'Japan' is found in the joined search_words string even though 'Japan' is not found in the search_words column in ROW 5.

What I think I need is something like this:

df1 = df1[df1.apply(lambda x: words_in_string(zip(x.search_words, x.text_to_search)), axis=1)]

def words_in_string(search_words, text_to_search): word_count = 0

for word in search_words: word_count = word_count + text_to_search.lower().find(word.lower())

return (word_count > 0)

I'm certainly not going to win any prizes for this code. It's awful. And it doesn't work. Please could you point me to where I am going wrong and how to resolve this?

THANK YOU!!!


Solution

  • I would do it this way :

    ser1 = df["search_words"].str.split("\s*,\s*")
    ser2 = df["text_to_search"].str.lower()
    ​
    m = [any(w.lower() in t for w in lst) for lst, t in zip(ser1, ser2)]
    ​
    out = df.loc[m]
    

    ​ Output :

    print(out)
                              search_words                           text_to_search
    0  currency,CrossCurrencySwap,COPTIONS    text1 text2, "text3": TEXT4; Currency
    1                          Exotic Bond     text1, text 2 / text 3 'Exotic Bond'
    2                            JPN,Japan               Japan, Asia, JPN, EMEA, US
    6                              forward  FRAforwardrateagreementforwardbondtrade