Search code examples
rclassificationtmtext2vec

how to train a lasso with both text and numeric variables?


Consider this modified classic example:

library(dplyr)
library(tibble)

dtrain <- data_frame(text = c("Chinese Beijing Chinese",
                              "Chinese Chinese Shanghai",
                              "France",
                              "Tokyo Japan Chinese"),
                     add_numeric = c(1, 1, 0, 1),
                     doc_id = 1:4,
                     class = c(1, 1, 1, 0))


> dtrain
# A tibble: 4 x 4
  text                     add_numeric doc_id class
  <chr>                          <dbl>  <int> <dbl>
1 Chinese Beijing Chinese            1      1     1
2 Chinese Chinese Shanghai           1      2     1
3 France                             0      3     1
4 Tokyo Japan Chinese                1      4     0

Here, I would like to use lasso to predict class. The variables of interest are text and add_numeric.

I know how to use text2vec or tm to predict class using text only: the packages will transform text into a sparse document term matrix and feed the model.

However, here, I want to use both a textual variable text, and add_numeric. I do not know how to mix the two approaches. Any ideas? Thanks!


Solution

  • I haven't checked how to do this with text2vec, but with quanteda this is quite easy to do, just using cbind and the advantage is that is stays a sparse matrix. I haven't changed the dimnames so the added column will be shown as feat1.

    library(quanteda)
    
    dtm <- dfm(dtrain$text) # create documenttermmatrix
    dtm_num <- cbind(dtm, dtrain$add_numeric) # add column to sparse matrix.
    dtm_num
    Document-feature matrix of: 4 documents, 7 features (60.7% sparse).
    4 x 7 sparse Matrix of class "dfm"
           features
    docs    chinese beijing shanghai france tokyo japan feat1
      text1       2       1        0      0     0     0     1
      text2       2       0        1      0     0     0     1
      text3       0       0        0      1     0     0     0
      text4       1       0        0      0     1     1     1