I have a column in pandas dataframe with millions of rows. Many words are non-English (e.g. words from other languages or that do not mean anything, like "**5hjh"). I thought of using Wordnet as a comprehensive English dictionary to help me clean up this column, which comprises lists. Ideally, the output should be a new column with English words only.
I have tried the following code, which I got from Stackoverflow, but it does not seem to be working as it returns an empty column with no words whatsoever:
from nltk.corpus import wordnet
def check_for_word(s):
return ' '.join(w for w in str(s).split(',') if len(wordnet.synsets(w)) > 0)
df["new_column"] = df["original_column"].apply(check_for_word)
This expression str(s).split(',') creates a list of strings that contain whitespace as the first character for all words except the first one (assuming the str(s) worked as expected). When you then do this: wordnet.synsets(w) you basically look up w which has the whitespace as the first character in wordnet and it is not there, so all synsets will be of length 0.
E.g. len(wordnet.synsets(' october')) will be zero.
I recommend debugging to
If you provide a df and a screenshot of your output for that df, it would be easier to pinpoint the issue.
Update: addiditional points based on your comments above: Thank you, Fernanda. I've read your comments above (in the main thread). Here are a few more items you might find relevant:
if in wordnet.synsets(word):
syntaxis. Maybe it will be faster