Search code examples
rtmtopic-modelingsynonymtext2vec

Text preprocessing and topic modelling using text2vec package


I have a large number of documents and I want to do topic modelling using text2vec and LDA (Gibbs Sampling).

Steps I need are as (in order):

  1. Removing numbers and symbols from the text

    library(stringr)
    docs$text <- stringr::str_replace_all(docs$text,"[^[:alpha:]]", " ")
    docs$text <- stringr::str_replace_all(docs$text,"\\s+", " ")
    
  2. Removing stop words

    library(text2vec)        
    library(tm)
    
    stopwords <- c(tm::stopwords("english"),custom_stopwords)
    
    prep_fun <- tolower
    tok_fun <- word_tokenizer
    tok_fun <- word_tokenizer    
    tokens <- docs$text%>% 
             prep_fun %>% 
             tok_fun
    it <- itoken(tokens, 
                ids = docs$id,
                progressbar = FALSE)
    
    v <- create_vocabulary(it, stopwords = stopwords) %>% 
        prune_vocabulary(term_count_min = 10)
    
    vectorizer <- vocab_vectorizer(v)
    
  3. Replacing synonyms by terms

I have an excel file in which first column is the main word and synonyms are listed in second, third and ... columns. I want to replace all synonyms by main words (column #1). Each term can have different number of synonyms. Here is an example of code using "tm" package (but I am interested to the one in text2vec package):

replaceSynonyms <- content_transformer(function(x, syn=NULL)
       {Reduce(function(a,b) {
       gsub(paste0("\\b(", paste(b$syns, collapse="|"),")\\b"), b$word,     a, perl = TRUE)}, syn, x)  })

 l <- lapply(as.data.frame(t(Synonyms), stringsAsFactors = FALSE), #
          function(x) { 
            x <- unname(x) 
            list(word = x[1], syns = x[-1])
          })
names(l) <- paste0("list", Synonyms[, 1])
list2env(l, envir = .GlobalEnv)

synonyms <- list()        
for (i in 1:length(names(l))) synonyms[i] = l[i]

MyCorpus <- tm_map(MyCorpus, replaceSynonyms, synonyms)
  1. Convert to document term matrix

    dtm  <- create_dtm(it, vectorizer)
    
  2. Apply LDA model on document term matrix

    doc_topic_prior <- 0.1  # can be chosen based on data? 
    lda_model <- LDA$new(n_topics = 10, 
              doc_topic_prior = doc_topic_prior, topic_word_prior = 0.01)
    doc_topic_distr <- lda_model$fit_transform(dtm, n_iter = 1000, convergence_tol <- 0.01, check_convergence_every_n = 10)
    

MyCorpurs in Step 3 is the corpus obtained using "tm" package. Step 2 and Step 3 do not work together as the output of Step 2 is vocab but the input for Step 3 is a "tm" corpus.

My first question, here, is that how can I do all steps using text2vec package (and compatible packages) as I found it very efficient; thanks to Dmitriy Selivanov.

Second: how we set optimal values for parameters in LDA in Step 5? Is it possible to set them automatically based on data?

Thanks to Manuel Bickel for corrections in my post.

Thanks, Sam


Solution

  • Updated answer in response to your comment:

    First question: The issue of synonym replacement has already been answered here: Replace words in text2vec efficiently. Check the answer of count in partiular. Patterns and replacements may be ngrams (multi word phrases). Please note that the second answer of Dmitriy Selivanov uses word_tokenizer() and does not cover the case of ngram replacement in the form presented.

    Is there any reason why you need to replace synonyms before stopword removal? Usually this order should not cause problems; or do you have an example in which switching the order produces significanlty different results? If you really want to replace synonyms after stopword removal, I guess, that you would have to apply such changes to the dtm when using text2vec. If you do so, you need to allow ngrams in your dtm with a minimum ngram length as included in your synonyms. I have provided a workaround in below code as one option. Please note, that allowing higher ngrams in your dtm produces noise that may or may not influence your downstream tasks (you can probably prune most of the noise in the vocabulary step). Therefore, replacing ngrams in earlier seems to be a better solution.

    Second question: You might check the package (and the source code) of the textmineR package which helps you to select the best number of topics or also the answer to this question Topic models: cross validation with loglikelihood or perplexity. Regarding handling of priors I have not figured out yet, how different packages, e.g., text2vec (WarpLDA algorithm), lda (Collaped Gibbs Sampling algorithm and others), or topicmodels ('standard' Gibbs Sampling and Variational Expectation-Maximization algorithm) handle these values in detail. As a starting point, you might have a look at the detailed documentation of topicmodels, chapter "2.2. Estimation" tells you how the alpha and beta parameters are estimated that are defined in "2.1 Model specification".

    For the purpose of learning, please note that your code produced errors at two points, which I have revised: (1) you need to use the correct name for stopwords in create_vocabulary(), stopwords instead of stop_words, since you defined the name as such (2) you do not need vocabulary =... in your lda model definition - maybe you use an older version of text2vec?

    library(text2vec) 
    library(reshape2)
    library(stringi)
    
    #function proposed by @count
    mgsub <- function(pattern,replacement,x) {
      if (length(pattern) != length(replacement)){
        stop("Pattern not equal to Replacment")
      } 
      for (v in 1:length(pattern)) {
        x  <- gsub(pattern[v],replacement[v],x, perl = TRUE)
      }
      return(x )
    }
    
    docs <- c("the coffee is warm",
              "the coffee is cold",
              "the coffee is hot",
              "the coffee is boiling like lava",
              "the coffee is frozen",
              "the coffee is perfect",
              "the coffee is warm almost hot"
    )
    
    synonyms <- data.frame(mainword = c("warm", "cold")
                           ,syn1 = c("hot", "frozen")
                           ,syn2 = c("boiling like lava", "")
                           ,stringsAsFactors = FALSE)
    
    synonyms[synonyms == ""] <- NA
    
    synonyms <- reshape2::melt(synonyms
                               ,id.vars = "mainword"
                               ,value.name = "synonym"
                               ,na.rm = TRUE)
    
    synonyms <- synonyms[, c("mainword", "synonym")]
    
    
    prep_fun <- tolower
    tok_fun <- word_tokenizer
    tokens <- docs %>% 
      #here is where you might replace synonyms directly in the docs
      #{ mgsub(synonyms[,"synonym"], synonyms[,"mainword"], . ) } %>%
      prep_fun %>% 
      tok_fun
    it <- itoken(tokens, 
                 progressbar = FALSE)
    
    v <- create_vocabulary(it,
                           sep_ngram = "_",
                           ngram = c(ngram_min = 1L
                                     #allow for ngrams in dtm
                                     ,ngram_max = max(stri_count_fixed(unlist(synonyms), " "))
                                     )
    )
    
    vectorizer <- vocab_vectorizer(v)
    dtm <- create_dtm(it, vectorizer)
    
    #ngrams in dtm
    colnames(dtm)
    
    #ensure that ngrams in synonym replacement table have the same format as ngrams in dtm
    synonyms <- apply(synonyms, 2, function(x) gsub(" ", "_", x))
    
    colnames(dtm) <- mgsub(synonyms[,"synonym"], synonyms[,"mainword"], colnames(dtm))
    
    
    #only zeros/ones in dtm since none of the docs specified in my example
    #contains duplicate terms
    dim(dtm)
    #7 24
    max(dtm)
    #1
    
    #workaround to aggregate colnames in dtm
    #I think there is no function `colsum` that allows grouping
    #therefore, a workaround based on rowsum
    #not elegant because you have to transpose two times, 
    #convert to matrix and reconvert to sparse matrix
    dtm <- 
      Matrix::Matrix(
        t(
          rowsum(t(as.matrix(dtm)), group = colnames(dtm))
        )
        , sparse = T)
    
    
    #synonyms in columns replaced
    dim(dtm)
    #7 20
    max(dtm)
    #2