Starting from the following example
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
body = [
'the quick brown fox',
'the slow brown dog',
'the quick red dog',
'the lazy yellow fox'
]
vectorizer = TfidfVectorizer(use_idf=False, norm='l1')
bag_of_words = vectorizer.fit_transform(body)
svd = TruncatedSVD(n_components=2)
lsa = svd.fit_transform(bag_of_words)
I would like to understand if there is (perhaps in scikit-learn) a way to choose the most appropriate number of topics.
In my specific case I have chosen 2 topics (arbitrarily) but I would like to understand if there is a way in Python to generalize to larger cases (with more documents and more words) and choose the number of topics automatically.
Thank you for your help.
You can compute the explained variance with a range of the possible number of components. The maximum number of components is the size of your vocabulary.
performance = []
test = range(1, bag_of_words.shape[1], 2)
for n in test:
svd = TruncatedSVD(n_components=n)
lsa = svd.fit(bag_of_words)
performance.append(lsa.explained_variance_ratio_.sum())
fig = plt.figure(figsize=(15, 5))
plt.plot(test, performance, 'ro--')
plt.title('explained variance by n-components');
You can see in the graph by the slope between points how much each added component contributes to the performance of your model and when no more information is gained.
To get the number of components where no more information is added
import numpy as np
test[np.array(performance).argmax()]
Output
5
With the elbow method you can find the number of components before the largest decrease in added information
test[np.abs(np.gradient(np.gradient(performance))).argmax()]
Output
3