Search code examples
pythonpandasscikit-learnsklearn-pandastfidfvectorizer

Error with TfidfVectorizer and SelectKBest


I'm trying to follow this tutorial for doing some sentiment analysis, and I'm pretty sure my code is exactly the same up to this point. However, I'm getting a critical difference in values for my BOW.

https://www.tensorscience.com/nlp/sentiment-analysis-tutorial-in-python-classifying-reviews-on-movies-and-products

Here's my code up until this point.

import nltk
import pandas as pd
import string
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectKBest, chi2


def openFile(path):
    #param path: path/to/file.ext (str)
    #Returns contents of file (str)
    with open(path) as file:
        data = file.read()
    return data

imdb_data = openFile('C:/Users/Flengo/Desktop/sentiment/data/imdb_labelled.txt')
amzn_data = openFile('C:/Users/Flengo/Desktop/sentiment/data/amazon_cells_labelled.txt')
yelp_data = openFile('C:/Users/Flengo/Desktop/sentiment/data/yelp_labelled.txt')


datasets = [imdb_data, amzn_data, yelp_data]

combined_dataset = []
# separate samples from each other
for dataset in datasets:
    combined_dataset.extend(dataset.split('\n'))

# separate each label from each sample
dataset = [sample.split('\t') for sample in combined_dataset]


df = pd.DataFrame(data=dataset, columns=['Reviews', 'Labels'])
df = df[df["Labels"].notnull()]
df = df.sample(frac=1)


labels = df['Labels']
vectorizer = TfidfVectorizer(min_df=15)
bow = vectorizer.fit_transform(df['Reviews'])
len(vectorizer.get_feature_names())

selected_features = SelectKBest(chi2, k=200).fit(bow, labels).get_support(indices=True)
vectorizer = TfidfVectorizer(min_df=15, vocabulary=selected_features)
bow = vectorizer.fit_transform(df['Reviews'])

bow

Here's my result.

My result

This is the result from the tutorial. Problematic part in tutorial

I've been trying to figure out what could be the issue but I haven't gotten anything going yet.


Solution

  • The problem is you're supplying indices, try instead suplying a real vocab.

    Try this:

    selected_features = SelectKBest(chi2, k=200).fit(bow, labels).get_support(indices=True)
    vocabulary = np.array(vectorizer.get_feature_names())[selected_features]
    
    vectorizer = TfidfVectorizer(min_df=15, vocabulary=vocabulary) # you need to supply a real vocab here
    
    bow = vectorizer.fit_transform(df['Reviews'])
    bow
    <3000x200 sparse matrix of type '<class 'numpy.float64'>'
        with 12916 stored elements in Compressed Sparse Row format>