python pandas scikit-learn sklearn-pandas tfidfvectorizer

Error with TfidfVectorizer and SelectKBest

I'm trying to follow this tutorial for doing some sentiment analysis, and I'm pretty sure my code is exactly the same up to this point. However, I'm getting a critical difference in values for my BOW.

https://www.tensorscience.com/nlp/sentiment-analysis-tutorial-in-python-classifying-reviews-on-movies-and-products

Here's my code up until this point.

import nltk
import pandas as pd
import string
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectKBest, chi2


def openFile(path):
    #param path: path/to/file.ext (str)
    #Returns contents of file (str)
    with open(path) as file:
        data = file.read()
    return data

imdb_data = openFile('C:/Users/Flengo/Desktop/sentiment/data/imdb_labelled.txt')
amzn_data = openFile('C:/Users/Flengo/Desktop/sentiment/data/amazon_cells_labelled.txt')
yelp_data = openFile('C:/Users/Flengo/Desktop/sentiment/data/yelp_labelled.txt')


datasets = [imdb_data, amzn_data, yelp_data]

combined_dataset = []
# separate samples from each other
for dataset in datasets:
    combined_dataset.extend(dataset.split('\n'))

# separate each label from each sample
dataset = [sample.split('\t') for sample in combined_dataset]


df = pd.DataFrame(data=dataset, columns=['Reviews', 'Labels'])
df = df[df["Labels"].notnull()]
df = df.sample(frac=1)


labels = df['Labels']
vectorizer = TfidfVectorizer(min_df=15)
bow = vectorizer.fit_transform(df['Reviews'])
len(vectorizer.get_feature_names())

selected_features = SelectKBest(chi2, k=200).fit(bow, labels).get_support(indices=True)
vectorizer = TfidfVectorizer(min_df=15, vocabulary=selected_features)
bow = vectorizer.fit_transform(df['Reviews'])

bow

Here's my result.

This is the result from the tutorial.

I've been trying to figure out what could be the issue but I haven't gotten anything going yet.

Solution

The problem is you're supplying indices, try instead suplying a real vocab.

Try this:

selected_features = SelectKBest(chi2, k=200).fit(bow, labels).get_support(indices=True)
vocabulary = np.array(vectorizer.get_feature_names())[selected_features]

vectorizer = TfidfVectorizer(min_df=15, vocabulary=vocabulary) # you need to supply a real vocab here

bow = vectorizer.fit_transform(df['Reviews'])
bow
<3000x200 sparse matrix of type '<class 'numpy.float64'>'
    with 12916 stored elements in Compressed Sparse Row format>