I want to remove nonsense words in my dataset.
I tried which I saw StackOverflow something like this:
import nltk
words = set(nltk.corpus.words.words())
sent = "Io andiamo to the beach with my amico."
" ".join(w for w in nltk.wordpunct_tokenize(sent) \
if w.lower() in words or not w.isalpha())
But now since I have a dataframe how do i iterate it over the whole column.
I tried something like this:
import nltk
words = set(nltk.corpus.words.words())
sent = df['Chats']
df['Chats'] = df['Chats'].apply(lambda w:" ".join(w for w in
nltk.wordpunct_tokenize(sent) \
if w.lower() in words or not w.isalpha()))
But I am getting an error TypeError: expected string or bytes-like object
Something like the following will generate a column Clean
that applies your function to the column Chats
words = set(nltk.corpus.words.words())
def clean_sent(sent):
return " ".join(w for w in nltk.wordpunct_tokenize(sent) \
if w.lower() in words or not w.isalpha())
df['Clean'] = df['Chats'].apply(clean_sent)
To update the Chats
column itself, you can overwrite it using the original column:
df['Chats'] = df['Chats'].apply(clean_sent)