Search code examples
pythonnlplogistic-regression

Logistic regression: X has 667 features per sample; expecting 74869


Using a imdb movie reviews dataset i have made a logistic regression to predict the sentiment of the review.

tfidf = TfidfVectorizer(strip_accents=None, lowercase=False, preprocessor=None, 

tokenizer=fill, use_idf=True, norm='l2', smooth_idf=True)
y = df.sentiment.values
X = tfidf.fit_transform(df.review)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, test_size=0.3, shuffle=False)
clf = LogisticRegressionCV(cv=5, scoring="accuracy", random_state=1, n_jobs=-1, verbose=3,max_iter=300).fit(X_train, y_train)

yhat = clf.predict(X_test)


print("accuracy:")
print(clf.score(X_test, y_test))

model_performance(X_train, y_train, X_test, y_test, clf)

prior to this text preprocessing have been applied. Model performance is just a function to create a confusion matrix. this all works well with a good accuracy.

I now scrape new IMDB reviews:

#The movie "Joker" IMBD review page
url_link='https://www.imdb.com/title/tt7286456/reviews'
html=urlopen(url_link)

content_bs=BeautifulSoup(html)

JokerReviews = []
#All the reviews ends in a div class called text in html, can be found in the imdb source code
for b in content_bs.find_all('div',class_='text'):
  JokerReviews.append(b)

df = pd.DataFrame.from_records(JokerReviews)
df['sentiment'] = "0" 
jokerData=df[0]
jokerData = jokerData.apply(preprocessor)

Problem: Now i wish to test the same logistic regression to predict the sentiment:

tfidf2 = TfidfVectorizer(strip_accents=None, lowercase=False, preprocessor=None, tokenizer=fill, use_idf=True, norm='l2', smooth_idf=True)
y = df.sentiment.values
Xjoker = tfidf2.fit_transform(jokerData)

yhat = Clf.predict(Xjoker)

But i get the error: ValueError: X has 667 features per sample; expecting 74869

I dont get why it has to have the same amount of features as X_test


Solution

  • The problem is that your model was trained after a preprocessing that identified 74869 unique words, and the preprocessing of your input data for inference have identified 667 words, and you are supposed to send the data to the model with the same number of columns. Besides that, one of the 667 words identified for the inference may also don't be expected by the model as well.

    To create a valid input for your model, you have to use an approach such as:

    # check which columns are expected by the model, but not exist in the inference dataframe
    not_existing_cols = [c for c in X.columns.tolist() if c not in Xjoker]
    # add this columns to the data frame
    Xjoker = Xjoker.reindex(Xjoker.columns.tolist() + not_existing_cols, axis=1)
    # new columns dont have values, replace null by 0
    Xjoker.fillna(0, inplace = True)
    # use the original X structure as mask for the new inference dataframe
    Xjoker = Xjoker[X.columns.tolist()]
    

    After these steps, you can call the predict() method.