python scikit-learn nlp topic-modeling svd

Determine the correct number of topics using latent semantic analysis

Starting from the following example

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

body = [
    'the quick brown fox',
    'the slow brown dog',
    'the quick red dog',
    'the lazy yellow fox'
]

vectorizer = TfidfVectorizer(use_idf=False, norm='l1')
bag_of_words = vectorizer.fit_transform(body)

svd = TruncatedSVD(n_components=2)
lsa = svd.fit_transform(bag_of_words)

I would like to understand if there is (perhaps in scikit-learn) a way to choose the most appropriate number of topics.

In my specific case I have chosen 2 topics (arbitrarily) but I would like to understand if there is a way in Python to generalize to larger cases (with more documents and more words) and choose the number of topics automatically.

Thank you for your help.

Solution

You can compute the explained variance with a range of the possible number of components. The maximum number of components is the size of your vocabulary.

performance = []
test = range(1, bag_of_words.shape[1], 2)

for n in test:
    svd = TruncatedSVD(n_components=n)
    lsa = svd.fit(bag_of_words)
    performance.append(lsa.explained_variance_ratio_.sum())

fig = plt.figure(figsize=(15, 5))
plt.plot(test, performance, 'ro--')
plt.title('explained variance by n-components');

You can see in the graph by the slope between points how much each added component contributes to the performance of your model and when no more information is gained.

To get the number of components where no more information is added

import numpy as np

test[np.array(performance).argmax()]

Output

With the elbow method you can find the number of components before the largest decrease in added information

test[np.abs(np.gradient(np.gradient(performance))).argmax()]

Output