I am working with one large dataframe which contains multiple columns
Cutdown version of the data is as follows (the ROW column has only been inserted to demonstrate issues I've experienced, explained further below)
df1
ROW | search_words | text_to_search |
---|---|---|
1 | currency,CrossCurrencySwap,COPTIONS | text1 text2, "text3": TEXT4; Currency |
2 | Exotic Bond | text1, text 2 / text 3 'Exotic Bond' |
3 | JPN,Japan | Japan, Asia, JPN, EMEA, US |
4 | BasisSwap,basis swap,swap | EMEA |
5 | EMEA | US, Japan |
6 | irs,interestrateswap,interest rate swap | [NULL] |
7 | forward | FRAforwardrateagreementforwardbondtrade |
8 | equity swap, equityswap | equity, warrant, convertible, swap |
The goal is to filter dataframe so only those rows where words in the search_words column that exist in text_to_search column are contained within it
Once filtered, the dataframe should look like this:
df1
search_words | text_to_search |
---|---|
currency,CrossCurrencySwap,COPTIONS | text1 text2, "text3": TEXT4; Currency |
Exotic Bond | text1, text 2 / text 3 'Exotic Bond' |
JPN,Japan | Japan, Asia, JPN, EMEA, US |
forward | FRAforwardrateagreementforwardbondtrade |
As shown, the search_words column may contain one or more word on which to search. Where search_words contains a phrase (as in ROW 8 "equity swap") the WHOLE phrase must exist in text_to_search in order to find a match.
The text_to_search column has no discernible format so I cannot split the text into a list of words to use to compare against each word from the search_words column.
I started using the following line of code:
df1 = df1[df1.apply(lambda x: x.search_words in x['text_to_search'], axis=1)]
This returns no correct results.
I then tried reformatting the search_words column so that instead of a ',' delimiter, the words are separated by a pipe '|' so that the data looks like this: currency|CrossCurrencySwap|COPTIONS (and so on...)
And using the following code to try to establish a match:
df1 = df1.loc[df1['text_to_search'].str.contains('|'.join(map(str, df1['search_words']))))]
This line of code doesn't work either. It appears to join all words in the search_words column together and match all search_words to each line of text in the text_to_search column. In this way, I am getting false matches. For example ROW 5 is returned as a match because 'Japan' is found in the joined search_words string even though 'Japan' is not found in the search_words column in ROW 5.
What I think I need is something like this:
df1 = df1[df1.apply(lambda x: words_in_string(zip(x.search_words, x.text_to_search)), axis=1)]
def words_in_string(search_words, text_to_search): word_count = 0
for word in search_words: word_count = word_count + text_to_search.lower().find(word.lower())
return (word_count > 0)
I'm certainly not going to win any prizes for this code. It's awful. And it doesn't work. Please could you point me to where I am going wrong and how to resolve this?
THANK YOU!!!
I would do it this way :
ser1 = df["search_words"].str.split("\s*,\s*")
ser2 = df["text_to_search"].str.lower()
m = [any(w.lower() in t for w in lst) for lst, t in zip(ser1, ser2)]
out = df.loc[m]
Output :
print(out)
search_words text_to_search
0 currency,CrossCurrencySwap,COPTIONS text1 text2, "text3": TEXT4; Currency
1 Exotic Bond text1, text 2 / text 3 'Exotic Bond'
2 JPN,Japan Japan, Asia, JPN, EMEA, US
6 forward FRAforwardrateagreementforwardbondtrade