Search code examples
pythonregexpandasfindall

Finding specific words in a column


I'm trying to use findall function to find 4 specific words in a string in a column of a dataframe.

df = pd.DataFrame({'case':('Case1','Case2','Case3','Case4'),
                   'text':('good boy', 'bad girl', 'yoghurt', 'good girl yoghurt')})
    case    text
0   Case1   good boy
1   Case2   bad girl
2   Case3   yoghurt
3   Case4   good girl yoghurt

Let's say I want to find 'good' and 'yoghurt', creating a list where this dataset would give me: ['good',' ','yoghurt','good, yoghurt'] - so giving empty string or returning None and giving me both words if they are in the same row. I can then create a new column out of it, that's why it's important that I get every row, even if empty.

Most findall examples involve regex symbols and I'm trying to feed it with a list of words.


Solution

  • You can use str.findall with the | regex operator (meaning "or")

    df['new_column'] = df.text.str.findall('good|yoghurt')
    >>> df
        case               text       new_column
    0  Case1           good boy           [good]
    1  Case2           bad girl               []
    2  Case3            yoghurt        [yoghurt]
    3  Case4  good girl yoghurt  [good, yoghurt]
    

    If you want the words joined by a comma, in the way your question suggests, you can then apply ', '.join:

    df['new_column'] = df.text.str.findall('(good|yoghurt)').apply(', '.join,1)
    >>> df
        case               text     new_column
    0  Case1           good boy           good
    1  Case2           bad girl               
    2  Case3            yoghurt        yoghurt
    3  Case4  good girl yoghurt  good, yoghurt