Search code examples
pythonldatopic-modelingdocument-classificationpyldavis

How to get topic of new document in LDA model


How to pass a .txt document given by user dynamically in LDA model? I have tried the below code, but it's not working to give proper topic of the doc. The topic of my .txt is related to Sports so it should give the topic name as Sports. It's is giving the output as:

Score: 0.5569453835487366   - Topic: 0.008*"bike" + 0.005*"game" + 0.005*"team" + 0.004*"run" + 0.004*"virginia"
Score: 0.370819091796875    - Topic: 0.016*"game" + 0.014*"team" + 0.011*"play" + 0.008*"hockey" + 0.008*"player"
Score: 0.061239391565322876  -Topic: 0.010*"card" + 0.010*"window" + 0.008*"driver" + 0.007*"sale" + 0.006*"price"*
data = df.content.values.tolist()
data = [re.sub('\S*@\S*\s?', '', sent) for sent in data]
data = [re.sub('\s+', ' ', sent) for sent in data]
data = [re.sub("\'", "", sent) for sent in data]

def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations

data_words = list(sent_to_words(data))
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100) # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)  
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

def make_trigrams(texts):
    return [trigram_mod[bigram_mod[doc]] for doc in texts]

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):

    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out
# Remove Stop Words
data_words_nostops = remove_stopwords(data_words)
# Form Bigrams
data_words_bigrams = make_bigrams(data_words_nostops)
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

id2word = gensim.corpora.Dictionary(data_lemmatized)

texts = data_lemmatized

corpus = [id2word.doc2bow(text) for text in texts]
# Build LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=20, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)

#f = io.open("text.txt", mode="r", encoding="utf-8")

p=open("text.txt", "r") #document by the user which is related to sports

if p.mode == 'r':
    content = p.read()

bow_vector = id2word.doc2bow(lemmatization(p))

for index, score in sorted(lda_model[bow_vector], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 5)))



Solution

  • All your code is correct, but I think your expectation of LDA modeling might be a little off. That output you have received is the correct one!

    Firstly, you used the phrase "topic name"; the topics LDA generates don't have names, and they don't have a simple mapping to the labels of the data used to train the model. It's an unsupervised model, and often you'd train LDA with data that has NO labels. If your corpus contains documents belonging to classes A, B, C, D, and you train an LDA model to output four topics L, M, N, O, it does NOT follow, that there exists some mapping like:

    A -> M
    B -> L
    C -> O
    D -> N
    

    Secondly, be careful with the difference between tokens and topics in the output. The output of LDA looks something like:

    Topic 1: 0.5 - 0.005*"token_13" + 0.003*"token_204" + ...

    Topic 2: 0.07 - 0.01*"token_24" + 0.001*"token_3" + ...

    In other words, every document is given a probability of belonging to each of the topics. And every topic is comprised by a sum of each corpus token weighted in some way to uniquely define the topic.

    There is a temptation to look at the most heavily weighted tokens in each topic and interpret the topics as a class. For example:

    # If you have:
    topic_1 = 0.1*"dog" + 0.08*"cat" + 0.04*"snake"
    
    # It's tempting to name topic_1 = pets
    

    But this is very challenging to validate, and heavily dependent on human intuition. A more common usage of LDA is when you have no labels, and you want to identify which documents are semantically similar to each other, without necessarily determining what the correct class label for the documents is.