Search code examples
pythondata-miningsentiment-analysistfidfvectorizer

Error of TfidfVectorizer on cleaned text dataset


I am trying to vectorize a sentiment data set. It has review text and sentimentlabel given. When I try to vectorize the data set It gives an error called 'LazyCorpusLoader' object is not iterable

The reviews were cleaned as follows.

  • remove html tags
  • tokenize text to remove punctuations
  • remove stop words
  • POS tagging
  • lemmatize text

After these my dataframe reviewdataset_df has following columns:

  1. reviews_clean->cleaned review text
  2. SENTIMENT-> a sentiment label as positive or negative

then I split the data set using below code,

#splitting data set into training and testing
X_train,X_test,Y_train,Y_test =train_test_split(reviewDataset_Df.head(10000).review_clean,reviewDataset_Df.head(10000).SENTIMENT,test_size=0.20,random_state=0,shuffle=True)                                          

print('Training data count:'+str(len(X_train)))
print('Test data count:'+str(len(X_test)))

That worked well.

Then I use vectorizer using following code.

#vectorizer
tfidf=TfidfVectorizer(sublinear_tf=True,min_df=3,stop_words=english,norm='l2',encoding='utf-8',ngram_range=(1,3))
print("rr")
train_features=tfidf.fit_transform(X_train)
test_features=tfidf.transform(X_test)
train_labels=Y_train
test_labels=Y_test

This gives an error as return frozenset(stop) TypeError: 'LazyCorpusLoader' object is not iterable

I searched and tried on some solutions which didn't worked. How to overcome this error. I need to vectorize the data set to train for a recommendation system.

note: I searched through internet and read similar question in stackoverflow but couldn't find a proper answer.


Solution

  • Without a proper error trace we can only guess.

    Since the error involves stop my guess is that your variable english - that isn't in the code you shared at all - is inappropriately set up, and not a set of words.

    You probably meant to use stop_words="english" instead.