I have a pandas dataframe that contains a column of several thousands of comments. I would like to iterate through every row in the column, check to see if the comment contains any word found in a list of words I've created, and if the comment contains a word from my list I want to label it as such in a separate column. This is what I have so far in my code:
retirement_words_list = ['match','matching','401k','retirement','retire','rsu','rrsp']
def word_checker(row):
for sentence in df['comments']:
if any(word in re.findall(r'\w+', sentence.lower()) for word in retirement_words_list):
return '401k/Retirement'
else:
return 'Other'
df['topic'] = df.apply(word_checker,axis=1)
The code is labeling every single comment in my dataframe as 'Other' even though I have double-checked that many comments contain one or several of the words from my list. Any ideas for how I may correct my code? I'd greatly appreciate your help.
Probably more convenient to have a set version of retirements_word_list
(for efficient inclusing testing) and then loop over words in the sentence, checking inclusion in this set, rather than the other way round:
retirement_words_list = ['match','matching','401k','retirement','retire','rsu','rrsp']
retirement_words_set = set(retirement_words_list)
and then
if any(word in retirement_words_list for word in sentence.lower().split()):
# .... etc ....
Your code is just checking whether any word in retirement_words_list
is a substring of the sentence, but in fact you must be looking for whole-word matches or it wouldn't make sense to include 'matching'
and 'retirement'
on the list given that 'match'
and 'retire'
are already included. Hence the use of split
-- and the reason why we can then also reverse the logic.
NOTE: You may need some further changes because your function word_checker
has a parameter called row
which it does not use. Possibly what you meant to do was something like:
def word_checker(sentence):
if any(word in retirement_words_list for word in sentence.lower().split()):
return '401k/Retirement'
else:
return 'Other'
and:
df['topic'] = df['comments'].apply(word_checker,axis=1)
where sentence
is the contents of each row from the comments
column.