Intro
Currently I am using Gensim in combination with pandas and numpy to run document NLP computation. I'd like to build a LDA seqential model to track how our topics change over time but am running into errors with the corpus format.
I am trying to figure out how to set time slices for dynamic topic models. I am using LdaSeqModel which requires an integer time slice.
The Data
It's a csv:
data = pd.read_csv('CGA Jan17 - Mar19 Time Slice.csv', encoding = "ISO-8859-1");
documents = data[['TextForTopics']]
documents['index'] = documents.index
Month Year Begin Date TextForTopics time_slice
0 march 2017 3/23/2017 request: the caller is requesting an appointme... 1
This is then converted into an array of tuples called the bow_corpus:
[[(12, 2), (25, 1), (30, 1)], [(33, 1), (136, 1), (159, 1), (161, 1)], [(165, 1), (247, 2)], (326, 1), (354, 1), (755, 1), (821, 1)]]
Desired Output
It should print one topic allocation for each time slice. If I entered 3 topics and two time slices I should get three topics printed twice showing how the topics evolved over time.
[(0,
'0.165*"enrol" + 0.108*"medicar" + 0.051*"form"),
(1,
'0.303*"caller" + 0.290*"inform" + 0.031*"abl"),
(2,
'0.208*"date" + 0.140*"effect" + 0.060*"medicaid"')]
[(0,
'0.165*"enrol" + 0.108*"cats" + 0.051*"form"),
(1,
'0.303*"caller" + 0.290*"puppies" + 0.031*"abl"),
(2,
'0.208*"date" + 0.140*"elephants" + 0.060*"medicaid"')]
What I've tried
This is the function - the bow corpus is an array of tuples
ldaseq = LdaSeqModel(corpus=bow_corpus, time_slice=[], num_topics=15, chunksize=1)
I've tried every version of integer inputs for those time_slices and they all produce errors. The premise was that the time_slice would represent the number of indicies/rows/documents in each time slice. For example my data has 1.8 million rows if I wanted two time slices I would order my data by time and enter an integer cutoff like time_slice = [489234, 1310766]. All inputs produce this error:
The Error
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-5-e58059a7fb6f> in <module>
----> 1 ldaseq = LdaSeqModel(corpus=bow_corpus, time_slice=[], num_topics=15, chunksize=1)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\gensim\models\ldaseqmodel.py in __init__(self, corpus, time_slice, id2word, alphas, num_topics, initialize, sstats, lda_model, obs_variance, chain_variance, passes, random_state, lda_inference_max_iter, em_min_iter, em_max_iter, chunksize)
186
187 # fit DTM
--> 188 self.fit_lda_seq(corpus, lda_inference_max_iter, em_min_iter, em_max_iter, chunksize)
189
190 def init_ldaseq_ss(self, topic_chain_variance, topic_obs_variance, alpha, init_suffstats):
~\AppData\Local\Continuum\anaconda3\lib\site-packages\gensim\models\ldaseqmodel.py in fit_lda_seq(self, corpus, lda_inference_max_iter, em_min_iter, em_max_iter, chunksize)
275 # seq model and find the evidence lower bound. This is the E - Step
276 bound, gammas = \
--> 277 self.lda_seq_infer(corpus, topic_suffstats, gammas, lhoods, iter_, lda_inference_max_iter, chunksize)
278 self.gammas = gammas
279
~\AppData\Local\Continuum\anaconda3\lib\site-packages\gensim\models\ldaseqmodel.py in lda_seq_infer(self, corpus, topic_suffstats, gammas, lhoods, iter_, lda_inference_max_iter, chunksize)
351 bound, gammas = self.inferDTMseq(
352 corpus, topic_suffstats, gammas, lhoods, lda,
--> 353 ldapost, iter_, bound, lda_inference_max_iter, chunksize
354 )
355 elif model == "DIM":
~\AppData\Local\Continuum\anaconda3\lib\site-packages\gensim\models\ldaseqmodel.py in inferDTMseq(self, corpus, topic_suffstats, gammas, lhoods, lda, ldapost, iter_, bound, lda_inference_max_iter, chunksize)
401 time = 0 # current time-slice
402 doc_num = 0 # doc-index in current time-slice
--> 403 lda = self.make_lda_seq_slice(lda, time) # create lda_seq slice
404
405 time_slice = np.cumsum(np.array(self.time_slice))
~\AppData\Local\Continuum\anaconda3\lib\site-packages\gensim\models\ldaseqmodel.py in make_lda_seq_slice(self, lda, time)
459 """
460 for k in range(self.num_topics):
--> 461 lda.topics[:, k] = self.topic_chains[k].e_log_prob[:, time]
462
463 lda.alpha = np.copy(self.alphas)
IndexError: index 0 is out of bounds for axis 1 with size 0
Solutions
I tried going back to the documentation and looking at the format of the common_corpus used as an example and the format of my bow_corpus is the same. I also tried running the code in the documentation to see how it worked but it also produced the same error. I'm not sure if the problem is my code anymore but I hope it is.
I've also tried messing with the file format by manually dividing my csv into 9 csvs containing my time_slices and creating an iterated corpus out of those, but that didn't work. I've considered converting each row of my csv into txt files and then creating a corpus out of that like David Beil does, but that sounds pointlessly tedious as I already have an iterated corpus.
I'm going to assume you are working in a single dataframe. Let's say you want to use years as your unit of time.
time_slice
to work properly with ldaseqmodel
you need to
first order your dataframe ascending, i.e. from oldest to newest.import numpy as np
uniqueyears, time_slices = np.unique(data.Year, return_counts=True)
#takes all unique values in data.Year as well as how often they occur and returns them as an array.
print(np.asarray((uniqueyears, time_slices)).T)
#see what youve made, technically you dont need this
returns (using example data)
[[1992 28]
[1993 18]
[1994 25]
[1995 18]
[1996 44]
[1997 38]
[1998 30]]
This works for years, if you want to go more fine-grained, you could adapt the same concept, as long as you have the ordering of the documents (which is how gensim connects them to time slices) right. So, for example if you want to take monthly slices, you could rewrite the dates as 20173 for March 2017 and 20174 for April 2014. Really, any grain will do as long as you can identify documents as belonging to the same slice.