Search code examples
rldatidytexttopicmodels

Restore original document id from lda object


I'm trying to compare the "consensus" topic prediction (beta) from terms (in a given document) against the most likely predicted topic from the document itself (gamma) using functions from topicmodels. While it's easy to extract the most likely predicted topic from the document using groupby() over document and selecting top_n() on gamma, but in the "beta" estimate, the unique document id will be suppressed in the output, the output only contains three columns (topic, term, beta). This does not allow one to obtain the "consensus" topic prediction (beta) from terms for a given document.

Using my own data as an example:

Sys.setlocale("LC_ALL","Chinese")  # reset to simplified Chinese encoding as the text data is in Chinese
library(foreign)
library(dplyr)
library(plyr)
library(tidyverse)
library(tidytext)
library(tm)
library(topicmodels)

sample_dtm <- readRDS(gzcon(url("https://www.dropbox.com/s/gznqlncd9psx3wz/sample_dtm.rds?dl=1")))

lda_out <- LDA(sample_dtm, k = 2, control = list(seed = 1234))

word_topics <- tidy(lda_out, matrix = "beta")

head(word_topics, n = 4)
# A tibble: 6 x 3
  topic term      beta
  <int> <chr>    <dbl>
1     1 费解  8.49e- 4
2     2 费解  1.15e- 9
3     1 上    2.92e- 3

document_gamma <- tidy(lda_out, matrix = "gamma")

head(document_gamma, n = 4)
# A tibble: 6 x 3
  document topic   gamma
  <chr>    <int>   <dbl>
1 1203232      1 0.00374
2 529660       1 0.0329 
3 738921       1 0.00138
4 963374       1 0.302

Is there anyway I can restore the document id from the lda output and combine with the beta estimate (word_topics, which is stored as a data.frame object)? Such that it will be much easier to compare the estimated topic from the consensus of beta versus that of gamma.


Solution

  • If I am understanding you correctly, I believe the function you want is augment(), which returns a table with one row per original document-term pair, connected to topics.

    Sys.setlocale("LC_ALL","Chinese")  # reset to simplified Chinese encoding as the text data is in Chinese
    #> Warning in Sys.setlocale("LC_ALL", "Chinese"): OS reports request to set
    #> locale to "Chinese" cannot be honored
    #> [1] ""
    library(foreign)
    library(dplyr)
    library(plyr)
    #> -------------------------------------------------------------------------
    #> You have loaded plyr after dplyr - this is likely to cause problems.
    #> If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
    #> library(plyr); library(dplyr)
    #> -------------------------------------------------------------------------
    #> 
    #> Attaching package: 'plyr'
    #> The following objects are masked from 'package:dplyr':
    #> 
    #>     arrange, count, desc, failwith, id, mutate, rename, summarise,
    #>     summarize
    library(tidyverse)
    library(tidytext)
    library(tm)
    library(topicmodels)
    
    sample_dtm <- readRDS(gzcon(url("https://www.dropbox.com/s/gznqlncd9psx3wz/sample_dtm.rds?dl=1")))
    
    lda_out <- LDA(sample_dtm, k = 2, control = list(seed = 1234))
    
    augment(lda_out, sample_dtm)
    #> # A tibble: 18,676 x 4
    #>    document term     count .topic
    #>    <chr>    <chr>    <dbl>  <dbl>
    #>  1 649      作揖         1      1
    #>  2 649      拳头         1      1
    #>  3 649      赞           1      1
    #>  4 656      住           1      1
    #>  5 656      小区         1      1
    #>  6 656      没           1      1
    #>  7 656      注意         2      1
    #>  8 1916     中国         1      1
    #>  9 1916     中国台湾     1      1
    #> 10 1916     反对         1      1
    #> # … with 18,666 more rows
    

    Created on 2019-06-04 by the reprex package (v0.2.1)

    This connects the document ID from the LDA model to the topics. It sounds like you already understand this, but just to reiterate:

    • the beta matrix is word-topic probabilities
    • the gamma matrix is document-topic probabilities