Search code examples
pythonpandasnltkdata-analysisstop-words

Unhashable type: 'list' error for stopwords


Here is my code

URL to CSV file: https://github.com/eugeneketeni/web-mining-final-project/blob/master/Test_file.csv

import pandas as pd

data = pd.read_csv("https://raw.githubusercontent.com/eugeneketeni/web- 
mining-final-project/master/Test_file.csv")

import nltk
from nltk import word_tokenize, sent_tokenize


data['text'] = data.loc[:, 'text'].astype(str)

text = data.loc[:, "text"].astype(str)
tokenizer = [word_tokenize(text[i]) for i in range(len(text))]
print(tokenizer)

filtered_sentence = []


from nltk.corpus import stopwords
stopwords = set(stopwords.words('english'))

filtered_sentence = [w for w in tokenizer if not w in stopwords]
print(filtered_sentence) 

My tokenizer works but when I try to remove the default stopwords, I keep getting "unhashable type: 'list'" error. I am not sure what really going on. I would appreciate any help. Thanks.


Solution

  • TL;DR

    from nltk import word_tokenize
    from nltk.corpus import stopwords
    
    import pandas as pd
    
    stoplist = set(stopwords.words('english'))
    
    data = pd.read_csv("Test_file.csv")
    
    data['filtered_text'] = data['text'].astype(str).apply(lambda line: [token for token in word_tokenize(line) if token not in stoplist])
    

    In Long

    Please see Why is my NLTK function slow when processing the DataFrame? for more detailed explanation on:

    • tokenize text in a dataframe
    • remove stopwords
    • other related cleaning processes

    For better, twitter text processing

    pip3 install -U nltk[twitter]
    

    Then use this:

    from nltk.corpus import stopwords

    from nltk.tokenize import TweetTokenizer
    
    import pandas as pd
    
    word_tokenize = TweetTokenizer().tokenize
    
    stoplist = set(stopwords.words('english'))
    
    data = pd.read_csv("Test_file.csv")
    
    data['filtered_text'] = data['text'].astype(str).apply(lambda line: [token for token in word_tokenize(line) if token not in stoplist])