Search code examples
pythonlda

problem with input features for latent dirichlet allocation


I am trying to make predicitions with my LDA model. But when i pass a string to it it gives an error about mismatching input features. Now my question is how can i make my model accept any input and still predict the right topic. Right now it takes 54777 as input.

model:

cv = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')
dtm = cv.fit_transform(npr['Article'])
LDA = LatentDirichletAllocation(n_components=7,random_state=42)
LDA.fit(dtm)

prediction

txt = ["The election of Donald Trump was a surprise to pollsters, pundits and, perhaps most of all, the Democratic Party."]
vectorizer = CountVectorizer()
txt_vectorized = vectorizer.fit_transform(txt)
predict = LDA.transform(txt_vectorized)
print(predict)

error:

ValueError: X has 16 features, but LatentDirichletAllocation is expecting 54777 features as input.

Solution

  • There are three issues with this code snippet.

    • Issue-1: max_df and min_df should be both int or both float.
    • Issue-2: At the prediction time you have to use the same CountVectorizer.
    • Issue-3: At the prediction time you have to use the transform method, not the fit_transform method of CountVectorizer.

    Here is an example code that will help you:

    from sklearn.feature_extraction.text import CountVectorizer
    corpus = [
        'This is the first document.',
        'This document is the second document.',
        'And this is the third one.',
        'Is this the first document?',
    ]
    cv = CountVectorizer()
    

    Train the model:

    from sklearn.decomposition import LatentDirichletAllocation
    
    dtm = cv.fit_transform(corpus)
    LDA = LatentDirichletAllocation(n_components=7,random_state=42)
    LDA.fit(dtm)
    

    Prediction:

    txt = ["This is a new document"]
    txt_vectorized = cv.transform(txt)
    predict = LDA.transform(txt_vectorized)
    print(predict)