Search code examples
rtext-mininglda

LDA Returning numbers instead of words from Term Document Matrix


I am trying to use the LDA function to evaluate a corpus of text in R. However, when I do so, it seems to use the row names of the observations rather than the actual words in the corpus. I can't find anything else about this online so I imagine I must be doing something very basic incorrectly.

library(tm)
library(SnowballC)
library(tidytext)
library(stringr)
library(tidyr)
library(topicmodels)
library(dplyr)

#read in data
data <- read.csv('CSV_format_data.csv',sep=',')
#Create corpus/DTM
interviews <- as.matrix(data[,2])
ints.corpus <- Corpus(VectorSource(interviews))
ints.dtm <- TermDocumentMatrix(ints.corpus)

chapters_lda <- LDA(ints.dtm, k = 4, control = list(seed = 5421685))
chapters_lda_td <- tidy(chapters_lda,matrix="beta")
chapters_lda_td

head(ints.dtm$dimnames$Terms)

The 'chapters_lda_td' command outputs

# A tibble: 4,084 x 3
   topic term        beta
   <int> <chr>      <dbl>
 1     1 1     0.000555  
 2     2 1     0.00399   
 3     3 1     0.000614  
 4     4 1     0.000699  
 5     1 2     0.0000195 
 6     2 2     0.000708  
 7     3 2     0.000731  
 8     4 2     0.00000155
 9     1 3     0.000974  
10     2 3     0.0000363 
# ... with 4,074 more rows

Note that there are numbers instead of words as there should be in the "term" column. The number of rows matches the number of documents times the number of topics, rather than the number of terms times the number of topics, as it should be. The 'head(ints.dtm$dimnames$Terms)' is to check that there are actually words in the DTM, which there are. The result is:

[1] "aaye"      "able"      "adjust"    "admission" "after"     "age" 

The data file itself is a pretty standard two-column CSV file with an ID and a block of text, and hasn't given me any problem while doing other text-mining stuff with it and the tm package. Any help would be appreciated, thank you!


Solution

  • I figured it out! It is because I am using the command

    ints.dtm <- TermDocumentMatrix(ints.corpus)
    

    rather than

    ints.dtm <- DocumentTermMatrix(ints.corpus)
    

    I guess the ordering of Term and Document switches their dimnames order around, so LDA grabs the wrong one.