python-3.x scikit-learn nlp latent-semantic-analysis

How Sklearn Latent Dirichlet Allocation really Works?

I have some texts and I'm using sklearn LatentDirichletAllocation algorithm to extract the topics from the texts.

I already have the texts converted into sequences using Keras and I'm doing this:

from sklearn.decomposition import LatentDirichletAllocation

lda = LatentDirichletAllocation()
X_topics = lda.fit_transform(X)

X:

print(X)
#  array([[0, 988, 233, 21, 42, 5436, ...],
   [0, 43, 6526, 21, 566, 762, 12, ...]])

X_topics:

print(X_topics)
#  array([[1.24143852e-05, 1.23983890e-05, 1.24238815e-05, 2.08399432e-01,
    7.91563331e-01],
   [5.64976371e-01, 1.33304549e-05, 5.60003133e-03, 1.06638803e-01,
    3.22771464e-01]])

My question is, what is exactly what's being returned from fit_transform, I know that should be the main topics detected from the texts but I cannot map those numbers to an index so I'm not able to see what those sequences means, I failed at searching for an explanation of what is actually happening, so any suggestion will be much appreciated.

Solution

First, a general explanation - think of LDiA as a clustering algorithm, that's going to determine, by default, 10 centroids, based on the frequencies of words in the texts, and it's going to put greater weights on some of those words than others by virtue of proximity to the centroid. Each centroid represents a 'topic' in this context, where the topic is unnamed, but can be sort of described by the words that are most dominant in forming each cluster.

So generally what you're doing with LDA is:

getting it to tell you what the 10 (or whatever) topics are of a given text.
or
getting it to tell you which centroid/topic some new text is closest to

For the second scenario, your expectation is that LDiA will output the "score" of the new text for each of the 10 clusters/topics. The index of the highest score is the index of the cluster/topic to which that new text belongs.

I prefer gensim.models.LdaMulticore, but since you've used the sklearn.decomposition.LatentDirichletAllocation I'll use that.

Here's some sample code (drawn from here) that runs through this process

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.datasets import fetch_20newsgroups
import random

n_samples = 2000
n_features = 1000
n_components = 10
n_top_words = 20

def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
    print()
    
data, _ = fetch_20newsgroups(shuffle=True, random_state=1,
                             remove=('headers', 'footers', 'quotes'),
                             return_X_y=True)
X = data[:n_samples]
#create a count vectorizer using the sklearn CountVectorizer which has some useful features
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                                max_features=n_features,
                                stop_words='english')
vectorizedX = tf_vectorizer.fit_transform(X)
lda = LatentDirichletAllocation(n_components=n_components, max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)
lda.fit(vectorizedX)

Now let's try a new text:

testX = tf_vectorizer.transform(["I am educated about learned stuff"])
#get lda to score this text against each of the 10 topics
lda.transform(testX)

Out:
array([[0.54995409, 0.05001176, 0.05000163, 0.05000579, 0.05      ,
        0.05001033, 0.05000001, 0.05001449, 0.05000123, 0.05000066]])

#looks like the first topic has the high score - now what are the words that are most associated with each topic?
print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)

Out:
Topics in LDA model:
Topic #0: edu com mail send graphics ftp pub available contact university list faq ca information cs 1993 program sun uk mit
Topic #1: don like just know think ve way use right good going make sure ll point got need really time doesn
Topic #2: christian think atheism faith pittsburgh new bible radio games alt lot just religion like book read play time subject believe
Topic #3: drive disk windows thanks use card drives hard version pc software file using scsi help does new dos controller 16
Topic #4: hiv health aids disease april medical care research 1993 light information study national service test led 10 page new drug
Topic #5: god people does just good don jesus say israel way life know true fact time law want believe make think
Topic #6: 55 10 11 18 15 team game 19 period play 23 12 13 flyers 20 25 22 17 24 16
Topic #7: car year just cars new engine like bike good oil insurance better tires 000 thing speed model brake driving performance
Topic #8: people said did just didn know time like went think children came come don took years say dead told started
Topic #9: key space law government public use encryption earth section security moon probe enforcement keys states lunar military crime surface technology

Seems sensible - the sample text is about education and the word cloud for the first topic is about education.

The pictures below are from another dataset (ham vs spam SMS messages, so only two possible topics) which I reduced to 3 dimensions with PCA, but in case a picture helps, these two (same data from different angles) might give a general sense of what's going on with LDiA. (graphs are from Latent Discriminant Analysis vs LDiA, but the representation is still relevant)

While LDiA is an unsupervised method, to actually use it in a business context you'll likely want to at least manually intervene to give the topics names that are meaningful to your context. e.g. Assigning a subject area to stories on a news aggregation site, choosing amongst ['Business', 'Sports', 'Entertainment', etc]

For further study, perhaps run through something like this: https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24