Search code examples
rquantedatopicmodels

How to keep the text id of removed text in lda


I have a dataframe like this

dtext <- data.frame(id = c(1,2,3,4), text = c("here","This dataset contains movie reviews along with their associated binary sentiment polarity labels. It is intended to serve as a benchmark for sentiment classification. This document outlines how the dataset was gathered, and how to use the files provided.", "The core dataset contains 50,000 reviews split evenly into 25k train and 25k test sets. The overall distribution of labels is balanced (25k pos and 25k neg). We also include an additional 50,000 unlabeled documents for unsupervised learning.", "There are two top-level directories [train/, test/] corresponding to the training and test sets. Each contains [pos/, neg/] directories for the reviews with binary labels positive and negative. Within these directories, reviews are stored in text files named following the convention [[id]_[rating].txt] where [id] is a unique id and [rating] is the star rating for that review on a 1-10 scale. For example, the file [test/pos/200_8.txt] is the text for a positive-labeled test set example with unique id 200 and star rating 8/10 from IMDb. The [train/unsup/] directory has 0 for all ratings because the ratings are omitted for this portion of the dataset."),stringsAsFactors = F)

I perform text clean for lda with this

library(quanteda)
library(topicmodels)
library(tidyverse)
toks <- tokens(dtext$text)
toks <- tokens_remove(toks, c(
  stopwords("en"),
  stringi::stri_replace_all_fixed(stopwords("en"), "'", "")
))
toks <- toks %>% tokens_wordstem()
myDfm <- dfm(toks, ngrams = c(2,3)) %>%
    dfm_trim(min_termfreq = 0.75, termfreq_type = "quantile")
dtm <- convert(myDfm, to = "topicmodels")
lda <- LDA(dtm, k = 2, control = list(seed = 1234))

However I noticed that in dtm when the text column doesn't not contain anything it remove it.

gammaDF <- as.data.frame(lda@gamma) 
toptopics <- as.data.frame(cbind(document = row.names(gammaDF), 
                                 topic = apply(gammaDF,1,function(x) names(gammaDF)[which(x==max(x))])))

However it gives me a problem when I want to take the topic and related id of the first dataframe. What can I do to have the right results?

id, topic
2    1
3    2
4    1

Solution

  • The problem here is that LDA() removes the rownames from your document-term matrix and replaces them with a simple serial number. This no longer corresponds to your original dtext$id. But you can replace the LDA id with the document name, and then link this back to your input text.

    To make this more clear, we are first going to replace your dtext$id with something that can be more clearly distinguished from the serial number that LDA() returns.

    # to distinguish your id from those from LDA()
    dtext$id <- paste0("doc_", dtext$id)
    
    # this takes the document name from "id"
    toks <- corpus(dtext, docid_field = "id") %>%
      tokens()
    

    Then run your other steps exactly as above.

    We can see that the first document is empty (has zero feature counts). This is the one that is dropped in the conversion of the dfm to the "topicmodels" format.

    ntoken(myDfm)
    ## text1 text2 text3 text4 
    ##     0    49    63   201
    
    as.matrix(dtm[, 1:3])
    ##        Terms
    ## Docs    dataset_contain contain_movi movi_review
    ##   text2               1            1           1
    ##   text3               1            0           0
    ##   text4               0            0           0
    

    These document names are obliterated by LDA(), however.

    toptopics
    ##   document topic
    ## 1        1    V2
    ## 2        2    V2
    ## 3        3    V1
    

    But we can (re)assign them from the rownames of dtm, which will correspond 1:1 to the documents returned by LDA().

    toptopics$docname <- rownames(dtm)
    toptopics
    ##   document topic docname
    ## 1        1    V2   text2
    ## 2        2    V2   text3
    ## 3        3    V1   text4
    

    And now, toptopics$docname can be merged with dtext$id, solving your problem.