Search code examples
python-3.xlda

Not very clear with python code of LDA algorithm


I am trying to implement Latent Dirichlet Allocation (LDA) using python with Gensim, also I am referring the LDA code from a website but I am still not very clear with LDA python code. Could someone who knows LDA explain to me in lucid manner according to the code as given below. I am also uploading the LDA formula here which is an image from wikipedia. In this case, LDA is being used to analyze a collection of text documents.

lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=4, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=10,
                                           passes=10,
                                           alpha='symmetric',
                                           iterations=100,
                                           per_word_topics=True)

LDA formula from wikipedia


Solution

  • LDA is a topic modeler. What it does is it takes a corpus which looks like this:

    # the words become numbers and are then counted for frequency
    # consider a random row 4310 - it has 27 words word indexed 2 shows up 4 times
    # preview the bag of words
    
    bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]
    bow_corpus[4310]

    [(3, 1), (13, 1), (37, 1), (38, 1), (39, 1), (50, 1), (52, 2)]

    # same thing in more words
    
    bow_doc_4310 = bow_corpus[4310]
    for i in range(len(bow_doc_4310)):
        print("Word {} (\"{}\") appears {} time.".format(bow_doc_4310[i][0], 
                                                   dictionary[bow_doc_4310[i][0]], 
    bow_doc_4310[i][1]))

    Word 3 ("assist") appears 1 time. Word 13 ("payment") appears 1 time. Word 37 ("account") appears 1 time. Word 38 ("card") appears 1 time. Word 39 ("credit") appears 1 time. Word 50 ("contact") appears 1 time. Word 52 ("rate") appears 2 time.

    Id2word maps the word in the dictionary to an index so 3 = 'assist' this is so it can later print topics. It using the number id since python processed numbers better and faster. So sentences to words, words to numbers, count for frequency then compare each word to all the other words in the corpus and scores them on how frequently they occur together. It takes the strongest correlation and turns it into a topic.

    num_topics is the number of topics you want it to generate. update_every is a way you can set up the lda to run dynamically it means it will rerun itself in the specified interval. Chuncksize takes just a part of the dataset it is useful if you want a test set and a validation set. Passes is how many times your algorithm seeks allocation - I'd be careful with a higher number on the wikipedia corpus mine converged to a single topic after two passes. Alpha is a hyper-parameter typically .1 iterations is the number of times it goes though the data set. per_word_topics tells your LDA to choose as it sees fit how many words it will put into a topic like one topic might have 70 words another 200. It prints 10 by default but you can change that. I hope this helps :)