I'm trying to use findall function to find 4 specific words in a string in a column of a dataframe.
df = pd.DataFrame({'case':('Case1','Case2','Case3','Case4'),
'text':('good boy', 'bad girl', 'yoghurt', 'good girl yoghurt')})
case text
0 Case1 good boy
1 Case2 bad girl
2 Case3 yoghurt
3 Case4 good girl yoghurt
Let's say I want to find 'good' and 'yoghurt', creating a list where this dataset would give me: ['good',' ','yoghurt','good, yoghurt'] - so giving empty string or returning None and giving me both words if they are in the same row. I can then create a new column out of it, that's why it's important that I get every row, even if empty.
Most findall examples involve regex symbols and I'm trying to feed it with a list of words.
You can use str.findall
with the |
regex operator (meaning "or")
df['new_column'] = df.text.str.findall('good|yoghurt')
>>> df
case text new_column
0 Case1 good boy [good]
1 Case2 bad girl []
2 Case3 yoghurt [yoghurt]
3 Case4 good girl yoghurt [good, yoghurt]
If you want the words joined by a comma, in the way your question suggests, you can then apply ', '.join
:
df['new_column'] = df.text.str.findall('(good|yoghurt)').apply(', '.join,1)
>>> df
case text new_column
0 Case1 good boy good
1 Case2 bad girl
2 Case3 yoghurt yoghurt
3 Case4 good girl yoghurt good, yoghurt