I am trying to make predicitions with my LDA model. But when i pass a string to it it gives an error about mismatching input features. Now my question is how can i make my model accept any input and still predict the right topic. Right now it takes 54777 as input.
model:
cv = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')
dtm = cv.fit_transform(npr['Article'])
LDA = LatentDirichletAllocation(n_components=7,random_state=42)
LDA.fit(dtm)
prediction
txt = ["The election of Donald Trump was a surprise to pollsters, pundits and, perhaps most of all, the Democratic Party."]
vectorizer = CountVectorizer()
txt_vectorized = vectorizer.fit_transform(txt)
predict = LDA.transform(txt_vectorized)
print(predict)
error:
ValueError: X has 16 features, but LatentDirichletAllocation is expecting 54777 features as input.
There are three issues with this code snippet.
max_df
and min_df
should be both int
or both float
.CountVectorizer
.transform
method, not the fit_transform
method of
CountVectorizer
.Here is an example code that will help you:
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
cv = CountVectorizer()
Train the model:
from sklearn.decomposition import LatentDirichletAllocation
dtm = cv.fit_transform(corpus)
LDA = LatentDirichletAllocation(n_components=7,random_state=42)
LDA.fit(dtm)
Prediction:
txt = ["This is a new document"]
txt_vectorized = cv.transform(txt)
predict = LDA.transform(txt_vectorized)
print(predict)