python scikit-learn tf-idf text-processing

Movie Ratings prediction using TF-IDF

I have a dataset having the format-

Movie_Name, TomatoCritics, Target_Variable

Here, TomatoCritics attribute has free text from different users for different movies. And Target_Variable is a binary value (0 or 1) telling whether this movie should be watched or not.

I am using TF-IDF to process this and my code is as follows-

import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer


# Read textual training data-
text_training = pd.read_csv("Textual-Training_Data.csv")

# Read textual testing data-
text_testing = pd.read_csv("Textual-Testing_Data.csv")

# Get dimensions of training data-
text_training.shape
# (95, 3)

# Get dimensions of testing data-
text_testing.shape
# (224, 3)


# Check for missing values in training data-
text_training.isnull().values.any()
# True

# Check for missing values in testing data-
text_testing.isnull().values.any()
# True

# Remove any row having missing value from training data-
text_training_nona = text_training.dropna(axis = 0, how='any')

# Remove any row having missing value from testing data-
text_testing_nona = text_testing.dropna(axis = 0, how = 'any')

# Get dimensions of training data AFTER removing empty rows-
text_training_nona.shape
# (73, 3)

# Get dimensions of testing data AFTER removing empty rows-
text_testing_nona.shape
# (158, 3)


# Attributes to use for training and testing sets for ML-
cols_train = ['tomatoConsensus', 'goodforairplanes']
cols_test = ['tomatoConsensus', 'goodforairplanes']



# Split training dataset into features (X) and label (y) for training-
X_train = text_training_nona['tomatoConsensus']
y_train = text_training_nona['goodforairplanes']


# Split training dataset into features (X) and label (y) for testing-
X_test = text_testing_nona["tomatoConsensus"]
y_test = text_testing_nona['goodforairplanes']




# Initialize Count Vectorizer using TF-IDF ->
cv = TfidfVectorizer(min_df = 1, stop_words='english')

# Convert text to TF-IDF ->
X_train_cv = cv.fit_transform(X_train)
X_test_cv = cv.fit_transform(X_test)

# Multinomial Naive Bayes classifier-
mnb = MultinomialNB()

# Train model on training data-
mnb.fit(X_train_cv, y_train)

print(X_test_cv[0])
'''
(0, 1168)   0.20066499253877468
  (0, 31)   0.2419027475877309
  (0, 1090) 0.22790133982975397
  (0, 5)    0.2616366234663056
  (0, 877)  0.2616366234663056
  (0, 1279) 0.2419027475877309
  (0, 850)  0.1786670002268731
  (0, 1341) 0.2616366234663056
  (0, 2)    0.2616366234663056
  (0, 695)  0.2616366234663056
  (0, 1221) 0.2419027475877309
  (0, 884)  0.1786670002268731
  (0, 1070) 0.2616366234663056
  (0, 782)  0.2616366234663056
  (0, 252)  0.20066499253877468
  (0, 1259) 0.2419027475877309
  (0, 1093) 0.20816746395117927
  (0, 122)  0.2170410042381541
'''

y_pred = mnb.predict(X_test_cv[0])

The last line using mnb.predict() gives the error-

ValueError: dimension mismatch

What's going wrong?

Thanks!

Solution

You should fit_transform once and then transform using existed cv and trained cv object. Change

X_train_cv = cv.fit_transform(X_train)
X_test_cv = cv.fit_transform(X_test)

To the

X_train_cv = cv.fit_transform(X_train)
X_test_cv = cv.transform(X_test)

- and this should fix your problem.

If you call fit_transofrm again with additional data, it probably contains another number of unique words and it will produce a vocabulary of another size - then, dimension of mnb trained with other data and other size of vaocabulary will be different - that's what ValueError: dimension mismatch.

Edit
Just check X_test_cv and X_train_cv for both cases - if you fit_transform for X_train and X_test, it gives different shapes, but if you replace the second fit_transform fot transform - they will be the same.