Search code examples
rnlptext-miningldatopic-modeling

Topic label of each document in LDA model using textmineR


I'm using textmineR to fit a LDA model to documents similar to https://cran.r-project.org/web/packages/textmineR/vignettes/c_topic_modeling.html. Is it possible to get the topic label for each document in the data set?

>library(textmineR)
>data(nih_sample)
> # create a document term matrix 
> dtm <- CreateDtm(doc_vec = nih_sample$ABSTRACT_TEXT,doc_names = 
 nih_sample$APPLICATION_ID, ngram_window = c(1, 2), stopword_vec = 
 c(stopwords::stopwords("en"), stopwords::stopwords(source = "smart")),lower 
 = TRUE, remove_punctuation = TRUE,remove_numbers = TRUE, verbose = FALSE, 
 cpus = 2) 
 >dtm <- dtm[,colSums(dtm) > 2]
 >set.seed(123)
 > model <- FitLdaModel(dtm = dtm, k = 20,iterations = 200,burnin = 
 180,alpha = 0.1, beta = 0.05, optimize_alpha = TRUE, calc_likelihood = 
 TRUE,calc_coherence = TRUE,calc_r2 = TRUE,cpus = 2)

then adding the labels to the model:

 > model$labels <- LabelTopics(assignments = model$theta > 0.05, dtm = dtm, 
   M = 1)

now I want the topic labels for each of 100 document in nih_sample$ABSTRACT_TEXT


Solution

  • Are you looking to label each document by the label of its most prevalent topic? IF so, this is how you could do it:

    # convert labels to a data frame so we can merge 
    label_df <- data.frame(topic = rownames(model$labels), label = model$labels, stringsAsFactors = FALSE)
    
    # get the top topic for each document
    top_topics <- apply(model$theta, 1, function(x) names(x)[which.max(x)][1])
    
    # convert the top topics for each document so we can merge
    top_topics <- data.frame(document = names(top_topics), top_topic = top_topics, stringsAsFactors = FALSE)
    
    # merge together. Now each document has a label from its top topic
    top_topics <- merge(top_topics, label_df, by.x = "top_topic", by.y = "topic", all.x = TRUE)
    
    

    This kind of throws away some information that you'd get from LDA though. One advantage of LDA is that each document can have more than one topic. Another is that we can see how much of each topic is in that document. You can do that here by

    # set the plot margins to see the labels on the bottom
    par(mar = c(8.1,4.1,4.1,2.1))
    
    # barplot the first document's topic distribution with labels
    barplot(model$theta[1,], names.arg = model$labels, las = 2)