Search code examples
pythonpandasdataframeif-statementany

Checking if any word in a string appears in a list using python


I have a pandas dataframe that contains a column of several thousands of comments. I would like to iterate through every row in the column, check to see if the comment contains any word found in a list of words I've created, and if the comment contains a word from my list I want to label it as such in a separate column. This is what I have so far in my code:

retirement_words_list = ['match','matching','401k','retirement','retire','rsu','rrsp']

def word_checker(row):
    for sentence in df['comments']: 
        if any(word in re.findall(r'\w+', sentence.lower()) for word in retirement_words_list):
            return '401k/Retirement'
        else:
            return 'Other'

df['topic'] = df.apply(word_checker,axis=1)    

The code is labeling every single comment in my dataframe as 'Other' even though I have double-checked that many comments contain one or several of the words from my list. Any ideas for how I may correct my code? I'd greatly appreciate your help.


Solution

  • Probably more convenient to have a set version of retirements_word_list (for efficient inclusing testing) and then loop over words in the sentence, checking inclusion in this set, rather than the other way round:

    retirement_words_list = ['match','matching','401k','retirement','retire','rsu','rrsp']
    
    retirement_words_set = set(retirement_words_list)
    

    and then

        if any(word in retirement_words_list for word in sentence.lower().split()):
                # .... etc ....
    

    Your code is just checking whether any word in retirement_words_list is a substring of the sentence, but in fact you must be looking for whole-word matches or it wouldn't make sense to include 'matching' and 'retirement' on the list given that 'match' and 'retire' are already included. Hence the use of split -- and the reason why we can then also reverse the logic.

    NOTE: You may need some further changes because your function word_checker has a parameter called row which it does not use. Possibly what you meant to do was something like:

    def word_checker(sentence):
        if any(word in retirement_words_list for word in sentence.lower().split()):
            return '401k/Retirement'
        else:
            return 'Other'
    

    and:

    df['topic'] = df['comments'].apply(word_checker,axis=1)    
    

    where sentence is the contents of each row from the comments column.