I have tried to remove non-English words from a text. Problem many other words are absent from the NLTK words corpus.
My code:
import pandas as pd
lst = ['I have equipped my house with a new [xxx] HP203X climatisation unit']
df = pd.DataFrame(lst, columns=['Sentences'])
import nltk
nltk.download('words')
words = set(nltk.corpus.words.words())
df['Sentences'] = df['Sentences'].apply(lambda x: " ".join(w for w in nltk.wordpunct_tokenize(x) if w.lower() in (words)))
df
Input: I have equipped my house with a new [xxx] HP203X climatisation unit
Result: I have my house with a new unit
Should have been: I have equipped my house with a new climatisation unit
I can't figure out how to complete nltk.corpus.words.words()
to avoid words like equipped
, climatisation
to be remouved from the sentences.
You can use
words.update(['climatisation', 'equipped'])
Here, words
is a set, that is why .extend(word_list)
did not work.