Search code examples
pythonnumpymachine-learningldatopic-modeling

Implementing Topic Model with Python (numpy)


Recently, I implemented Gibbs sampling for LDA topic model on Python using numpy, taking as a reference some code from a site. In each iteration of Gibbs sampling, we remove one (current) word, sample a new topic for that word according to a posterior conditional probability distribution inferred from the LDA model, and update word-topic counts, as follows:

for m, doc in enumerate(docs): #m: doc id
  for n, t in enumerate(doc): #n: id of word inside document, t: id of the word globally
    # discount counts for word t with associated topic z
    z = z_m_n[m][n]
    n_m_z[m][z] -= 1
    n_z_t[z, t] -= 1 
    n_z[z] -= 1
    n_m[m] -= 1

    # sample new topic for multinomial                
    p_z_left = (n_z_t[:, t] + beta) / (n_z + V * beta)
    p_z_right = (n_m_z[m] + alpha) / ( n_m[m] + alpha * K)
    p_z = p_z_left * p_z_right
    p_z /= numpy.sum(p_z)
    new_z = numpy.random.multinomial(1, p_z).argmax() 

    # set z as the new topic and increment counts
    z_m_n[m][n] = new_z
    n_m_z[m][new_z] += 1
    n_z_t[new_z, t] += 1
    n_z[new_z] += 1
    n_m[m] += 1

In the above code, we sample a new (single) z with the multinomial scipy function.

Now, I want to implement a Joint Sentiment Topic model of this paper. Now, I would need the following structures for keeping track of the needed counts:

3D matrix containing # occurrences for a word for each topic, for each sentiment
3D matrix containing # occurrences for a topic, for each sentiment, for each document
2D matrix containing # occurrences for a topic, for each sentiment
2D matrix containing # occurrences for a sentiment for each document

And now comes the problem: in this Gibbs sampler, for each word seen in a document both a new topic and a sentiment label are now sampled from a conditional posterior (page 4 equation 5 of the paper). How could I "sample those 2 values" in Python now ?

Thanks in advance...


Solution

  • Try this. Sampling from a joint distribution over topics and sentiment labels just means that the entire T x S matrix should sum to 1.

    docs=[[0,1],[0,0],[1,0,1]]
    D=len(docs)
    z_d_n=[[0 for _ in xrange(len(d))] for d in docs]
    l_d_n=[[0 for _ in xrange(len(d))] for d in docs]
    
    V=2
    T=2
    S=2
    n_m_j_k=numpy.zeros( (V,T,S) )
    n_j_k_d=numpy.zeros( (T,S,D) )
    n_j_k=numpy.zeros( (T,S) )
    n_k_d=numpy.zeros( (S,D) )
    n_d=numpy.zeros( (D) )
    
    beta=.1
    alpha=.1
    gamma=.1
    
    for d, doc in enumerate(docs): #d: doc id
        for n, m in enumerate(doc): #i: index of the word inside document, m: id of the word in the vocabulary
            # j is the topic
            j = z_d_n[d][n]
            # k is the sentiment
            k = l_d_n[d][n]
            n_m_j_k[m][j][k] += 1
            n_j_k_d[j][k][d] += 1
            n_j_k[j][k] += 1
            n_k_d[k][d] += 1
            n_d[d] += 1 
    
    for d, doc in enumerate(docs): #d: doc id
        for n, m in enumerate(doc): #i: index of the word inside document, m: id of the word in the vocabulary
            # j is the topic
            j = z_d_n[d][n]
            # k is the sentiment
            k = l_d_n[d][n]
            n_m_j_k[m][j][k] -= 1
            n_j_k_d[j][k][d] -= 1
            n_j_k[j][k] -= 1
            n_k_d[k][d] -= 1
            n_d[d] -= 1 
    
            # sample a new topic and sentiment label jointly
            # T is the number of topics
            # S is the number of sentiments
            p_left = (n_m_j_k[m] + beta) / (n_j_k + V * beta) # T x S array
            p_mid = (n_j_k_d[:,:,d] + alpha) / numpy.tile(n_k_d[:,d] + T * alpha, (T,1) )
            p_right = numpy.tile(n_k_d[:,d] + gamma,(T,1)) /  numpy.tile(n_d[d] + S * gamma,(T,S))
            p = p_left * p_mid * p_right
            p /= numpy.sum(p)
            new_jk = numpy.random.multinomial(1, numpy.reshape(p, (T*S) )).argmax()
            j=new_jk/T
            k=new_jk%T
    
            z_d_n[d][n]=j
            l_d_n[d][n]=k
            n_m_j_k[m][j][k] += 1
            n_j_k[j][k] += 1
            n_k_d[k][d] += 1
            n_d[d] += 1