Search code examples
pythonmachine-learningnlpnltk

Removing nonsense words in python


I want to remove nonsense words in my dataset.

I tried which I saw StackOverflow something like this:

import nltk
words = set(nltk.corpus.words.words())

sent = "Io andiamo to the beach with my amico."
" ".join(w for w in nltk.wordpunct_tokenize(sent) \
     if w.lower() in words or not w.isalpha())

But now since I have a dataframe how do i iterate it over the whole column.

I tried something like this:

import nltk
words = set(nltk.corpus.words.words())

sent = df['Chats']
df['Chats'] = df['Chats'].apply(lambda w:" ".join(w for w in 
nltk.wordpunct_tokenize(sent) \
     if w.lower() in words or not w.isalpha()))

But I am getting an error TypeError: expected string or bytes-like object


Solution

  • Something like the following will generate a column Clean that applies your function to the column Chats

    words = set(nltk.corpus.words.words())
    
    def clean_sent(sent):
        return " ".join(w for w in nltk.wordpunct_tokenize(sent) \
         if w.lower() in words or not w.isalpha())
    
    df['Clean'] = df['Chats'].apply(clean_sent)
    

    To update the Chats column itself, you can overwrite it using the original column:

    df['Chats'] = df['Chats'].apply(clean_sent)