Identity non-English words in a column in pandas dataframe using Wordnet

I have a column in pandas dataframe with millions of rows. Many words are non-English (e.g. words from other languages or that do not mean anything, like "**5hjh"). I thought of using Wordnet as a comprehensive English dictionary to help me clean up this column, which comprises lists. Ideally, the output should be a new column with English words only.

I have tried the following code, which I got from Stackoverflow, but it does not seem to be working as it returns an empty column with no words whatsoever:

from nltk.corpus import wordnet

def check_for_word(s):
    return ' '.join(w for w in str(s).split(',') if len(wordnet.synsets(w)) > 0)

df["new_column"] = df["original_column"].apply(check_for_word)

Solution

This expression str(s).split(',') creates a list of strings that contain whitespace as the first character for all words except the first one (assuming the str(s) worked as expected). When you then do this: wordnet.synsets(w) you basically look up w which has the whitespace as the first character in wordnet and it is not there, so all synsets will be of length 0.

E.g. len(wordnet.synsets(' october')) will be zero.

I recommend debugging to

check that the str(s) really creates a proper string and
make sure your 'w's are actually the words (e.g. do not start with whitespace). A simple solution could be to use the .trim() method if the only issue is the whitespace

If you provide a df and a screenshot of your output for that df, it would be easier to pinpoint the issue.

Update: addiditional points based on your comments above: Thank you, Fernanda. I've read your comments above (in the main thread). Here are a few more items you might find relevant:

wordnet contains only a few adverbs, so in your approach, you might be losing some adverbs
the synset counting is a bit slow. I'd use instead:

if in wordnet.synsets(word):

syntaxis. Maybe it will be faster

be careful with the idea of using word occurrence counting idea, as a large proportion of totally valid words is rare (appear only once in the corpus even for large corpora). This is related to Zipf law.
consider regular expressions based method to filter out words which contain unusual characters