Search code examples
pythonpandaslda

Assigning a topic to each document in a corpus (LDA)


I am trying to compute the probability of a document to belong to each topic found by the LDA model. I have succeded in producing the LDA but now I am stuck. My code goes as following:

## Libraries to download
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
import gensim

## Tokenizing
tokenizer = RegexpTokenizer(r'\w+')

# create English stop words list
en_stop = stopwords.words('english')

# Create p_stemmer of class PorterStemmer
p_stemmer = PorterStemmer()

import json
import nltk
import re
import pandas

appended_data = []
for i in range(2005,2016):
    if i > 2013:
        df0 = pandas.DataFrame([json.loads(l) for l in open('SDM_%d.json' % i)])
        appended_data.append(df0)
    df1 = pandas.DataFrame([json.loads(l) for l in open('Scot_%d.json' % i)])
    df2 = pandas.DataFrame([json.loads(l) for l in open('APJ_%d.json' % i)])
    df3 = pandas.DataFrame([json.loads(l) for l in open('TH500_%d.json' % i)])
    df4 = pandas.DataFrame([json.loads(l) for l in open('DRSM_%d.json' % i)])
    appended_data.append(df1)
    appended_data.append(df2)
    appended_data.append(df3)
    appended_data.append(df4)

appended_data = pandas.concat(appended_data)
doc_set = appended_data.body

# list for tokenized documents in loop
texts = []

# loop through document list
for i in doc_set:

    # clean and tokenize document string
    raw = i.lower()
    tokens = tokenizer.tokenize(raw)

    # remove stop words from tokens
    stopped_tokens = [i for i in tokens if not i in en_stop]

    # add tokens to list
    texts.append(stopped_tokens)

# turn our tokenized documents into a id <-> term dictionary
dictionary = corpora.Dictionary(texts)

# convert tokenized documents into a document-term matrix
corpus = [dictionary.doc2bow(text) for text in texts]

# generate LDA model
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=15, id2word = dictionary, passes=50)

I am trying to follow the method here but I find it confusing. For example, when I try the following code:

# Assinging the topics to the document in corpus
lda_corpus = ldamodel[corpus]

# Find the threshold, let's set the threshold to be 1/#clusters,
# To prove that the threshold is sane, we average the sum of all probabilities:
scores = list(chain(*[[score for topic_id,score in topic] \
                     for topic in [doc for doc in lda_corpus]]))

threshold = sum(scores)/len(scores)
print(threshold)

cluster1 = [j for i,j in zip(lda_corpus,doc_set) if i[0][1] > threshold]

print(cluster1)

It seems it works, since it retrieves the the articles that belong to topic 1. Nevertheless, can someone explain what is the intuition behind and if there is other alternatives. For example, what is the intuition behind the threshold level here? Thanks


Solution

  • As I hope you've read elsewhere, the threshold is an application-specific setting, depending on how broad you want your classification model. The 1/k (for k clusters) rationale is empirical: it works as a starting point (i.e. it yields recognizably useful results) for most classification tasks.

    The gut-level rationale is simple enough: if a document is matched to a topic strongly enough to outshine the chance of a random cluster placement, it's likely a positive identification. Of course, you have to tune "likely" once you get your first results.

    Most notably, you want to watch for one or two "noisy" clusters, those in which the topics are only loosely related: the cluster's standard deviation is larger than most. Some applications compute Z-scores for the topics and have a Z-based threshold for each topic. Others have a generic threshold for all the topics in a given cluster.

    Your final solution depends on your required strength of match (lower thresholds), topic variation (topic-specific thresholds), required accuracy (what are the costs of false positive and false negative?) and desired training & scoring speeds.

    Is this enough help to move you along?