I have two dataframe, I need to check contain substring from first df in each string in second df and get a list of words that are included in the second df
First df(word):
word |
---|
apples |
dog |
cat |
cheese |
Second df(sentence):
sentence |
---|
apples grow on a tree |
... |
I love cheese |
I tried this one:
tru=[]
for i in word['word']:
if i in sentence['sentence'].values:
tru.append(i)
And this one:
tru=[]
for i in word['word']:
if sentence['sentence'].str.contains(i):
tru.append(i)
I expect to get a list like ['apples',..., 'cheese']
One possible way is to use Series.str.extractall:
import re
import pandas as pd
df_word = pd.Series(["apples", "dog", "cat", "cheese"])
df_sentence = pd.Series(["apples grow on a tree", "i love cheese"])
pattern = f"({'|'.join(df_word.apply(re.escape))})"
matches = df_sentence.str.extractall(pattern)
matches
Output:
0
match
0 0 apples
1 0 cheese
You can then convert the results to a list
:
matches[0].unique().tolist()
Output:
['apples', 'cheese']