Search code examples
pythonpandaswordnet

Identity non-English words in a column in pandas dataframe using Wordnet


I have a column in pandas dataframe with millions of rows. Many words are non-English (e.g. words from other languages or that do not mean anything, like "**5hjh"). I thought of using Wordnet as a comprehensive English dictionary to help me clean up this column, which comprises lists. Ideally, the output should be a new column with English words only.

I have tried the following code, which I got from Stackoverflow, but it does not seem to be working as it returns an empty column with no words whatsoever:

from nltk.corpus import wordnet

def check_for_word(s):
    return ' '.join(w for w in str(s).split(',') if len(wordnet.synsets(w)) > 0)

df["new_column"] = df["original_column"].apply(check_for_word)

Solution

  • This expression str(s).split(',') creates a list of strings that contain whitespace as the first character for all words except the first one (assuming the str(s) worked as expected). When you then do this: wordnet.synsets(w) you basically look up w which has the whitespace as the first character in wordnet and it is not there, so all synsets will be of length 0.

    E.g. len(wordnet.synsets(' october')) will be zero.

    I recommend debugging to

    1. check that the str(s) really creates a proper string and
    2. make sure your 'w's are actually the words (e.g. do not start with whitespace). A simple solution could be to use the .trim() method if the only issue is the whitespace

    If you provide a df and a screenshot of your output for that df, it would be easier to pinpoint the issue.

    Update: addiditional points based on your comments above: Thank you, Fernanda. I've read your comments above (in the main thread). Here are a few more items you might find relevant:

    • wordnet contains only a few adverbs, so in your approach, you might be losing some adverbs
    • the synset counting is a bit slow. I'd use instead:
    if in wordnet.synsets(word):
    

    syntaxis. Maybe it will be faster

    • be careful with the idea of using word occurrence counting idea, as a large proportion of totally valid words is rare (appear only once in the corpus even for large corpora). This is related to Zipf law.
    • consider regular expressions based method to filter out words which contain unusual characters