Search code examples
rquanteda

how to use quanteda on aggregated data?


Consider this example

tibble(text = c('a grande latte with soy milk',
                'black coffee no room'),
       repetition = c(100, 2)) 
# A tibble: 2 x 2
  text                         repetition
  <chr>                             <dbl>
1 a grande latte with soy milk        100
2 black coffee no room                  2

The data means the the sentence a grande latte with soy milk appears 100 times in my dataset. Of course, it is a waste of memory to store that redundancy and this is why I have the repetition variable.

Still, I would like to have the dtm from quanteda to reflect that because the sparseness of the dfm gives me some room to keep that information. That is, how can I still have 100 rows for the first text in the dfm? Just using the following code does not take repetition into account

tibble(text = c('a grande latte with soy milk',
                'black coffee no room'),
       repetition = c(100, 2)) %>% 
  corpus() %>% 
  tokens() %>% 
  dfm()
Document-feature matrix of: 2 documents, 10 features (50.0% sparse).
2 x 10 sparse Matrix of class "dfm"
       features
docs    a grande latte with soy milk black coffee no room
  text1 1      1     1    1   1    1     0      0  0    0
  text2 0      0     0    0   0    0     1      1  1    1

Solution

  • Supposing your data.frame is called df1, you can use cbind to add a column to the dfm. But that might not give you the required result. The other two options below are probably better.

    cbind

    df1 <- tibble(text = c('a grande latte with soy milk',
                    'black coffee no room'),
           repetition = c(100, 2))
    
    my_dfm <- df1 %>%  
      corpus() %>% 
      tokens() %>% 
      dfm() %>% 
      cbind(repetition = df1$repetition) # add column to dfm with name repetition
    
    Document-feature matrix of: 2 documents, 11 features (45.5% sparse).
    2 x 11 sparse Matrix of class "dfm"
           features
    docs    a grande latte with soy milk black coffee no room repetition
      text1 1      1     1    1   1    1     0      0  0    0        100
      text2 0      0     0    0   0    0     1      1  1    1          2
    

    docvars

    You can also add data via the docvars function, then the data is added to the dfm but a bit more hidden in the dfm-class slots (reachable with @).

    docvars(my_dfm, "repetition") <- df1$repetition
    docvars(my_dfm)
    
          repetition
    text1        100
    text2          2
    

    multiplication

    Using multiplication:

    my_dfm * df1$repetition
    
    Document-feature matrix of: 2 documents, 10 features (50.0% sparse).
    2 x 10 sparse Matrix of class "dfm"
           features
    docs      a grande latte with soy milk black coffee no room
      text1 100    100   100  100 100  100     0      0  0    0
      text2   0      0     0    0   0    0     2      2  2    2