What's the correct way to extract tf-idf topfeatures by document?

Assume we have a tf-idf weighted dfm from a corpus of 10K rather small documents.

What's the quanteda way of extracting the top feature, i.e., max tf-idf values by document? I do want the entire corpus to be the reference when computing tf-idf. Something along the lines of

topfeatures(some_dfm_tf_idf, n =3, decreasing = TRUE, groups ="id")

returns an appropriate list. Yet it takes quite some time for something that is basically sorted out already at this point. Given that quanteda performs so well in everything I did so far, I am suspect I am might be doing something wrong here.

Maybe this is somewhat related to this discussion on github (https://github.com/quanteda/quanteda/issues/1646) and the example workaround that @Astelix shows.

Solution

topfeatures() is exactly the way to go. I'm not sure why you are stating that it "takes quite some time", or what your "id" docvar is, but the following is the correct and most efficient way to get a list of the top scored features in your dfm (regardless of the weighting).

The result is a named list where the names are the docnames, and each element is a named numeric vector where the element name is the feature label.

library("quanteda")
## Package version: 1.5.2

some_dfm_tf_idf <- dfm(data_corpus_irishbudget2010)[1:5, ] %>%
  dfm_tfidf()

topfeatures(some_dfm_tf_idf, n = 1, groups = docnames(some_dfm_tf_idf))
## $`Lenihan, Brian (FF)`
## details 
## 5.57116 
## 
## $`Bruton, Richard (FG)`
## confront 
##  5.59176 
## 
## $`Burton, Joan (LAB)`
## lenihan 
## 4.19382 
## 
## $`Morgan, Arthur (SF)`
##    sinn 
## 5.59176 
## 
## $`Cowen, Brian (FF)`
## dividend 
##  4.19382