Keep removed rows from a convert process

From this command

quant_stm <- convert(tDfm, to = "stm")

I receive a warning message Warning message:

In dfm2stm(x, docvars, omit_empty = TRUE) :
  Dropped empty document(s): g_32, m_21, g_32, [... truncated]

Is there any way to keep in a dataframe the values of this warning message?

Solution

Not that I can figure out. Why? Because the data structure for stm's "document" inputs do not have any way to record documents with no features.

Let's examine how it works. First, we create a dfm with three documents of four distinct features, with one document consisting only of the fourth feature (call it "d").

library("quanteda")
## Package version: 2.1.2

dfmat <- dfm(c(
  "a a c c",
  "b b c c",
  "d d d d"
))

Now if we remove that feature, the third document is now empty. This is what is being dropped in your output above.

(x <- dfm_remove(dfmat, "d"))
## Document-feature matrix of: 3 documents, 3 features (55.6% sparse).
##        features
## docs    a c b
##   text1 2 2 0
##   text2 0 2 2
##   text3 0 0 0

In the quanteda internal function dfm2dtm(), this is what is happening:

x <- x[, order(featnames(x))]
x <- as(x, "dgTMatrix")
structure(quanteda:::ijv.to.doc(x@i + 1, x@j + 1, x@x),
  names = rownames(x)[which(rowSums(x) > 0)]
)
## $text1
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    2
## 
## $text2
##      [,1] [,2]
## [1,]    2    3
## [2,]    2    2

Note in that object, which is the "documents" part of an stm input, the structure records one matrix of two rows for each document, where the first row is an index to the "vocab" element, and the second row is the count for that vocab element (feature). Only vocab elements with a non-zero count are recorded - which is why there is no column in text2 where the first row is "1" (since text2 has no "a" features).

So: the scheme itself has no way of recording things not found, and if nothing is found in a document it's omitted.

Note that there is no real reason to use convert(x, to = "stm") since the stm() function can take a dfm directly. (searchK() however cannot, so you might need it for that.)