Search code examples
rldatopicmodels

LDA TopicModels producing list of numbers rather than terms


Bear with me as I am extremely new to this and working on a project for a course in a certificate program.

I have .csv dataset that I obtained by retrieving bibliometric records from Pubmed and Embase databases. There are 1034 rows. There are several columns, however, I am trying to create topic models from just one column, the Abstract column and some records do not have an abstract. I've done some processing (removing stopwords, punctuation, etc.) and have been able to barplot words occurring more than 200 times as well as create a Frequent Term list by rank and can also run word associations with selected words. So, it seems r is seeing the words themselves in the Abstract field. My issue comes when I try to create topic models using the topicmodels package. Here's the bit of code I'm using.

#including 1st 3 lines for reference
options(header = FALSE, stringsAsFactors = FALSE, FileEncoding = 
"latin1")
records <- read.csv("Combined.csv")
AbstractCorpus <- Corpus(VectorSource(records$Abstract))

AbstractTDM <- TermDocumentMatrix(AbstractCorpus)
library(topicmodels)
library(lda)
lda <- LDA(AbstractTDM, k = 8)
(term <- terms(lda, 6))
term <- (apply(term, MARGIN = 2, paste, collapse = ","))

However, the output of topics I get is the following.

Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8

[1,] "499"   "733"   "390"   "833"   "17"    "413"   "719"   "392"  
[2,] "484"   "655"   "808"   "412"   "550"   "881"   "721"   "61"   
[3,] "857"   "299"   "878"   "909"   "15"    "258"   "47"    "164"  
[4,] "491"   "672"   "313"   "1028"  "126"   "55"    "375"   "987"  
[5,] "734"   "430"   "405"   "102"   "13"    "193"   "83"    "588"  
[6,] "403"   "52"    "489"   "10"    "598"   "52"    "933"   "980"  

Why am I not seeing words here rather than numbers?

Furthermore, the following code, which I basically took from the r PDF on topicmodels, does produce values for me, but the topics are still numbers rather than words, and this is meaningless to me.

#using information from topicmodels paper
library(tm)
library(topicmodels)
library(lda)
AbstractTM <- list(VEM = LDA(AbstractTDM, k = 10, control = list(seed =    
505)), VEM_fixed = LDA(AbstractTDM, k = 10, control = list(estimate.alpha 
= FALSE, seed = 505)), Gibbs = LDA(AbstractTDM, k = 10, method = "Gibbs", 
Control = list(seed = 505, burnin = 100, thin = 10, iter = 100)), CTM = 
CTM(AbstractTDM, k = 10, control = list(seed = 505, var = list(tol = 
10^-4), em = list(tol = 10^-3))))
#To compare the fitted models we first investigate the α values of the    
models fitted with VEM and α estimated and with VEM and α fixed 

sapply(AbstractTM[1:2], slot, "alpha")

#Find entropy 
sapply(AbstractTM, function(x)mean(apply(posterior(x)$topics, 1, 
function(z) - sum(z * log(z)))))

#Find estimated topics and terms
Topic <- topics(AbstractTM[["VEM"]], 1)
Topic
#find 5 most frequent terms for each topic
Terms <- terms(AbstractTM[["VEM"]], 5)
Terms[,1:5]

Any thoughts on what the issue might be?


Solution

  • Reading the topicmodels documentation, it does appear that the LDA() function expects a DocumentTermMatrix, not a TermDocumentMatrix. Try creating the former with DocumentTermMatrix(AbstractCorpus) and see if that works.