Search code examples
rsparse-matrixquanteda

how to sum the columns of a weighted dfm in quanteda?


Consider this funny example

mytib <- tibble(text = c('i can see clearly now',
                         'the rain is gone'),
                myweight = c(1.7, 0.005)) 
# A tibble: 2 x 2
  text                  myweight
  <chr>                    <dbl>
1 i can see clearly now    1.7  
2 the rain is gone         0.005

I know how to create a dfm weighted by the docvars myweight. I proceed as follows:

dftest <- mytib %>% 
  corpus() %>% 
  tokens() %>% 
  dfm()

dftest * mytib$myweight 

Document-feature matrix of: 2 documents, 9 features (50.0% sparse).
2 x 9 sparse Matrix of class "dfm"
       features
docs      i can see clearly now   the  rain    is  gone
  text1 1.7 1.7 1.7     1.7 1.7 0     0     0     0    
  text2 0   0   0       0   0   0.005 0.005 0.005 0.005

However the issue is that I cannot use neither topfeatures nor colSums.

How can sum the values in every column then?

> dftest*mytib$myweight %>% Matrix::colSums(.)
Error in base::colSums(x, na.rm = na.rm, dims = dims, ...) : 
  'x' must be an array of at least two dimensions

Thanks!


Solution

  • Sometimes the %>% operator harms rather than helps. This works:

    colSums(dftest * mytib$myweight)
    ##      i     can     see clearly     now     the    rain      is    gone 
    ##  1.700   1.700   1.700   1.700   1.700   0.005   0.005   0.005   0.005 
    

    Also consider using dfm_weight(x, weights = ...) if you have a vector of weights for each feature. The operation above will recycle your weights to make it work the way you want, but you should understand why (in R, because of recycling and because of its column-major order).