Search code examples

Nltk: Eliminating stop words from list of list

I am trying to remove stop words and tried the following:

tokenizer = RegexpTokenizer(r'\w+')
tokenized = data['data_column'].apply(tokenizer.tokenize)

Below output after tokenization

Name: data_column, dtype: object

I try to remove stop words using the below:

stop_words = set(stopwords.words('english'))
filtered_sentence = [w for w in tokenized if not w in stop_words]
filtered_sentence = []
 for w in tokenized:
    if w not in stop_words:

I get error:

TypeError                                 Traceback (most recent call last)
<ipython-input-272-d4a699384ffc> in <module>()
      2 stop_words = set(stopwords.words('english'))
----> 4 filtered_sentence = [w for w in tokenized if not w in stop_words]
      6 filtered_sentence = []

TypeError: unhashable type: 'list'


  • You need .apply() to filter the list from a list in series, since the corpus holds lowercased words you need to use .lower() before searching i.e

    stop_words = set(stopwords.words('english'))
    filtered_sentence = tokenized.apply(lambda x : [w for w in x if w.lower() not in stop_words])  

    Sample run

    from nltk.corpus import stopwords
    stop = set(stopwords.words('english'))
    df = pd.DataFrame({'words': [['A','SAMPLE','AS','OUTPUT','MSG']]})
    df['words'].apply(lambda x : [i for i in x if not i.lower() in stop])
    0    [SAMPLE, OUTPUT, MSG]
    Name: words, dtype: object