Search code examples
python-3.xnlpgensimldatopic-modeling

How to set time slices - Dynamic Topic Model


Intro

Currently I am using Gensim in combination with pandas and numpy to run document NLP computation. I'd like to build a LDA seqential model to track how our topics change over time but am running into errors with the corpus format.

I am trying to figure out how to set time slices for dynamic topic models. I am using LdaSeqModel which requires an integer time slice.

The Data

It's a csv:

data = pd.read_csv('CGA Jan17 - Mar19 Time Slice.csv', encoding = "ISO-8859-1");
documents = data[['TextForTopics']]
documents['index'] = documents.index

	       Month	Year	Begin Date	TextForTopics	                                      time_slice
0	march	2017	3/23/2017	request: the caller is requesting an appointme...	1

This is then converted into an array of tuples called the bow_corpus:

[[(12, 2), (25, 1), (30, 1)], [(33, 1), (136, 1), (159, 1), (161, 1)], [(165, 1), (247, 2)], (326, 1), (354, 1), (755, 1), (821, 1)]]

Desired Output

It should print one topic allocation for each time slice. If I entered 3 topics and two time slices I should get three topics printed twice showing how the topics evolved over time.

[(0,
  '0.165*"enrol" + 0.108*"medicar" + 0.051*"form"),
(1,
  '0.303*"caller" + 0.290*"inform" + 0.031*"abl"),
(2,
  '0.208*"date" + 0.140*"effect" + 0.060*"medicaid"')]
[(0,
  '0.165*"enrol" + 0.108*"cats" + 0.051*"form"),
(1,
  '0.303*"caller" + 0.290*"puppies" + 0.031*"abl"),
(2,
  '0.208*"date" + 0.140*"elephants" + 0.060*"medicaid"')]

What I've tried

This is the function - the bow corpus is an array of tuples

ldaseq = LdaSeqModel(corpus=bow_corpus, time_slice=[], num_topics=15, chunksize=1)

I've tried every version of integer inputs for those time_slices and they all produce errors. The premise was that the time_slice would represent the number of indicies/rows/documents in each time slice. For example my data has 1.8 million rows if I wanted two time slices I would order my data by time and enter an integer cutoff like time_slice = [489234, 1310766]. All inputs produce this error:

The Error

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-5-e58059a7fb6f> in <module>
----> 1 ldaseq = LdaSeqModel(corpus=bow_corpus, time_slice=[], num_topics=15, chunksize=1)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\gensim\models\ldaseqmodel.py in __init__(self, corpus, time_slice, id2word, alphas, num_topics, initialize, sstats, lda_model, obs_variance, chain_variance, passes, random_state, lda_inference_max_iter, em_min_iter, em_max_iter, chunksize)
    186 
    187             # fit DTM
--> 188             self.fit_lda_seq(corpus, lda_inference_max_iter, em_min_iter, em_max_iter, chunksize)
    189 
    190     def init_ldaseq_ss(self, topic_chain_variance, topic_obs_variance, alpha, init_suffstats):

~\AppData\Local\Continuum\anaconda3\lib\site-packages\gensim\models\ldaseqmodel.py in fit_lda_seq(self, corpus, lda_inference_max_iter, em_min_iter, em_max_iter, chunksize)
    275             # seq model and find the evidence lower bound. This is the E - Step
    276             bound, gammas = \
--> 277                 self.lda_seq_infer(corpus, topic_suffstats, gammas, lhoods, iter_, lda_inference_max_iter, chunksize)
    278             self.gammas = gammas
    279 

~\AppData\Local\Continuum\anaconda3\lib\site-packages\gensim\models\ldaseqmodel.py in lda_seq_infer(self, corpus, topic_suffstats, gammas, lhoods, iter_, lda_inference_max_iter, chunksize)
    351             bound, gammas = self.inferDTMseq(
    352                 corpus, topic_suffstats, gammas, lhoods, lda,
--> 353                 ldapost, iter_, bound, lda_inference_max_iter, chunksize
    354             )
    355         elif model == "DIM":

~\AppData\Local\Continuum\anaconda3\lib\site-packages\gensim\models\ldaseqmodel.py in inferDTMseq(self, corpus, topic_suffstats, gammas, lhoods, lda, ldapost, iter_, bound, lda_inference_max_iter, chunksize)
    401         time = 0  # current time-slice
    402         doc_num = 0  # doc-index in current time-slice
--> 403         lda = self.make_lda_seq_slice(lda, time)  # create lda_seq slice
    404 
    405         time_slice = np.cumsum(np.array(self.time_slice))

~\AppData\Local\Continuum\anaconda3\lib\site-packages\gensim\models\ldaseqmodel.py in make_lda_seq_slice(self, lda, time)
    459         """
    460         for k in range(self.num_topics):
--> 461             lda.topics[:, k] = self.topic_chains[k].e_log_prob[:, time]
    462 
    463         lda.alpha = np.copy(self.alphas)

IndexError: index 0 is out of bounds for axis 1 with size 0

Solutions

I tried going back to the documentation and looking at the format of the common_corpus used as an example and the format of my bow_corpus is the same. I also tried running the code in the documentation to see how it worked but it also produced the same error. I'm not sure if the problem is my code anymore but I hope it is.

I've also tried messing with the file format by manually dividing my csv into 9 csvs containing my time_slices and creating an iterated corpus out of those, but that didn't work. I've considered converting each row of my csv into txt files and then creating a corpus out of that like David Beil does, but that sounds pointlessly tedious as I already have an iterated corpus.


Solution

  • I'm going to assume you are working in a single dataframe. Let's say you want to use years as your unit of time.

    1. For time_slice to work properly with ldaseqmodel you need to first order your dataframe ascending, i.e. from oldest to newest.
    2. Create a time_slice variable so you can later feed it back into the model
    import numpy as np
    uniqueyears, time_slices = np.unique(data.Year, return_counts=True) 
    #takes all unique values in data.Year as well as how often they occur and returns them as an array.
    
    print(np.asarray((uniqueyears, time_slices)).T) 
    #see what youve made, technically you dont need this
    

    returns (using example data)

    [[1992   28]
     [1993   18]
     [1994   25]
     [1995   18]
     [1996   44]
     [1997   38]
     [1998   30]]
    

    This works for years, if you want to go more fine-grained, you could adapt the same concept, as long as you have the ordering of the documents (which is how gensim connects them to time slices) right. So, for example if you want to take monthly slices, you could rewrite the dates as 20173 for March 2017 and 20174 for April 2014. Really, any grain will do as long as you can identify documents as belonging to the same slice.