Search code examples
rtmtext2vec

Convert DocumentTermMatrix to dgTMatrix


I'm trying to run the AssociatedPress dataset from the tm-package through text2vec's LDA implementation.

The problem I'm facing is the incompatibility of data types: AssociatedPress is a tm::DocumentTermMatrix which in turn is a subclass of slam::simple_triplet_matrix. text2vec however expects the input x to text2vec::lda$fit_transform(x = ...) to be Matrix::dgTMatrix.

My question thus is: is there a way to coerce DocumentTermMatrix to something accepted by text2vec?

Minimal (failing) example:

library('tm')
library('text2vec')

data("AssociatedPress", package="topicmodels")

dtm <- AssociatedPress[1:10, ]

lda_model = LDA$new(
  n_topics = 10,
  doc_topic_prior = 0.1,
  topic_word_prior = 0.01
)

doc_topic_distr =
  lda_model$fit_transform(
    x = dtm,
    n_iter = 1000,
    convergence_tol = 0.001,
    n_check_convergence = 25,
    progressbar = FALSE
  )

...which gives:

base::rowSums(x, na.rm = na.rm, dims = dims, ...) : 'x' must be an array of at least two dimensions


Solution

  • The answer is in the duplicate supplied by @Dmitriy Selivanov. But it doesn't mention that it comes from the base package Matrix.

    Since I do not have topicmodels installed, I will use the crude dataset which is included in the tm package. The principle is the same.

    library(tm)
    data("crude")
    
    dtm <- DocumentTermMatrix(crude,
                              control = list(weighting =
                                               function(x)
                                                 weightTfIdf(x, normalize =
                                                               FALSE),
                                             stopwords = TRUE))
    
    # transform into a sparseMatrix dgcMatrix
    m <-  Matrix::sparseMatrix(i=dtm$i, 
                               j=dtm$j, 
                               x=dtm$v, 
                               dims=c(dtm$nrow, dtm$ncol),
                               dimnames = dtm$dimnames)
    str(m)
    Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
      ..@ i       : int [1:1890] 6 1 18 6 6 5 9 12 9 5 ...
      ..@ p       : int [1:1201] 0 1 2 3 4 5 6 8 9 11 ...
      ..@ Dim     : int [1:2] 20 1200
      ..@ Dimnames:List of 2
      .. ..$ Docs : chr [1:20] "127" "144" "191" "194" ...
      .. ..$ Terms: chr [1:1200] "\"(it)" "\"demand" "\"expansion" "\"for" ...
      ..@ x       : num [1:1890] 4.32 4.32 4.32 4.32 4.32 ...
      ..@ factors : list()
    

    rest of your code:

    library(text2vec)
    
    lda_model <- LDA$new(
      n_topics = 10,
      doc_topic_prior = 0.1,
      topic_word_prior = 0.01
    )
    
    doc_topic_distr <-
      lda_model$fit_transform(
        x = m,
        n_iter = 1000,
        convergence_tol = 0.001,
        n_check_convergence = 25,
        progressbar = FALSE
      )
    
    INFO [2018-04-15 10:40:00] iter 25 loglikelihood = -32949.882
    INFO [2018-04-15 10:40:00] iter 50 loglikelihood = -32901.801
    INFO [2018-04-15 10:40:00] iter 75 loglikelihood = -32922.208
    INFO [2018-04-15 10:40:00] early stopping at 75 iteration