Search code examples
pythonscikit-learnnlptopic-modelingsvd

Determine the correct number of topics using latent semantic analysis


Starting from the following example

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

body = [
    'the quick brown fox',
    'the slow brown dog',
    'the quick red dog',
    'the lazy yellow fox'
]

vectorizer = TfidfVectorizer(use_idf=False, norm='l1')
bag_of_words = vectorizer.fit_transform(body)

svd = TruncatedSVD(n_components=2)
lsa = svd.fit_transform(bag_of_words)

I would like to understand if there is (perhaps in scikit-learn) a way to choose the most appropriate number of topics.

In my specific case I have chosen 2 topics (arbitrarily) but I would like to understand if there is a way in Python to generalize to larger cases (with more documents and more words) and choose the number of topics automatically.

Thank you for your help.


Solution

  • You can compute the explained variance with a range of the possible number of components. The maximum number of components is the size of your vocabulary.

    performance = []
    test = range(1, bag_of_words.shape[1], 2)
    
    for n in test:
        svd = TruncatedSVD(n_components=n)
        lsa = svd.fit(bag_of_words)
        performance.append(lsa.explained_variance_ratio_.sum())
    
    fig = plt.figure(figsize=(15, 5))
    plt.plot(test, performance, 'ro--')
    plt.title('explained variance by n-components');
    

    You can see in the graph by the slope between points how much each added component contributes to the performance of your model and when no more information is gained.

    variancebyn

    To get the number of components where no more information is added

    import numpy as np
    
    test[np.array(performance).argmax()]
    

    Output

    5
    

    With the elbow method you can find the number of components before the largest decrease in added information

    test[np.abs(np.gradient(np.gradient(performance))).argmax()]
    

    Output

    3