Search code examples
rquanteda

R. Quanteda package. How to filter the values present in the dfm_tfidf?


So I have a dfm_tfidf and I want filter out values that are below a certain threshold.

Code:

dfmat2 <-
  matrix(c(1,1,2,1,0,0, 1,1,0,0,2,3),
         byrow = TRUE, nrow = 2,
         dimnames = list(docs = c("document1", "document2"),
                         features = c("this", "is", "a", "sample",
                                      "another", "example"))) %>%
  as.dfm()


#it works
dfmat2 %>% dfm_trim(min_termfreq = 3)

#it does not work
dfm_tfidf(dfmat2) %>% dfm_trim( min_termfreq = 1)
# "Warning message: In dfm_trim.dfm(., min_termfreq = 1) : dfm has been previously weighted"

Question: How can I filter out the values present in the dfm_tfidf?

Thank you


Solution

  • Here's a function to do that in sparse matrix space, based on an absolute minimum value. But beware since tf-idf absolute values don't mean much across different dfm objects.

    library("quanteda")
    ## Package version: 2.1.1
    
    dfmat2 <-
      matrix(c(1, 1, 2, 1, 0, 0, 1, 1, 0, 0, 2, 3),
        byrow = TRUE, nrow = 2,
        dimnames = list(
          docs = c("document1", "document2"),
          features = c(
            "this", "is", "a", "sample",
            "another", "example"
          )
        )
      ) %>%
      as.dfm()
    
    # function to trim features based on absolute minimum threshold
    # operating directly on sparse matrix
    dfm_trimabs <- function(x, min) {
      maxvals <- sapply(
        split(dfmat3@x, featnames(dfmat3)[as(x, "dgTMatrix")@j + 1]),
        max
      )
      dfm_keep(x, names(maxvals)[maxvals >= min])
    }
    

    Now apply it to the example above, before and after:

    # before trimming
    dfm_tfidf(dfmat2)
    ## Document-feature matrix of: 2 documents, 6 features (33.3% sparse).
    ##            features
    ## docs        this is       a  sample another example
    ##   document1    0  0 0.60206 0.30103 0       0      
    ##   document2    0  0 0       0       0.60206 0.90309
    
    # after trimming
    dfm_tfidf(dfmat2) %>%
      dfm_trimabs(min = 0.5)
    ## Document-feature matrix of: 2 documents, 3 features (50.0% sparse).
    ##            features
    ## docs              a another example
    ##   document1 0.60206 0       0      
    ##   document2 0       0.60206 0.90309