Search code examples
rsparse-matrixtf-idfquanteda

quanteda : Remove empty documents to compute tfidf but keep them in the final dfm


I am trying to compute tfidf on a dataset with a lot of empty documents. I wanted to compute tfidf without the empty documents, but still have as an output a dfm object with the original number of documents.

Here's an example :

texts = c("", "Bonjour!", "Hello, how are you", "", "Good", "", "", "")
a = texts %>%
    tokens(tolower=T, remove_punct=T) %>%
    dfm() %>%
    dfm_wordstem() %>%
    dfm_remove(stopwords("en")) %>%
    dfm_tfidf()
print(a, max_ndoc=10)
Document-feature matrix of: 8 documents, 3 features (87.50% sparse) and 0 docvars.
       features
docs    bonjour   hello    good
  text1 0       0       0      
  text2 0.90309 0       0      
  text3 0       0.90309 0      
  text4 0       0       0      
  text5 0       0       0.90309
  text6 0       0       0      
  text7 0       0       0      
  text8 0       0       0    

But IDF is affected by the number of empty documents, which I do not want. Therefore, I compute tfidf on the subset of non-empty documents like so :

a2 = texts %>%
    tokens(tolower=T, remove_punct=T) %>%
    dfm() %>%
    dfm_subset(ntoken(.) > 0) %>%
    dfm_wordstem() %>%
    dfm_remove(stopwords("en")) %>%
    dfm_tfidf()
print(a2, max_ndoc=10)
Document-feature matrix of: 3 documents, 3 features (66.67% sparse) and 0 docvars.
       features
docs      bonjour     hello      good
  text2 0.4771213 0         0        
  text3 0         0.4771213 0        
  text5 0         0         0.4771213

I now want to have a sparse matrix with the same format as the first matrix, but with the previous values for the texts. I found this code on stackoverflow: https://stackoverflow.com/a/65635722

add_rows_2 <- function(M,v) {
    oldind <- unique(M@i)
    ## new row indices
    newind <- oldind + as.integer(rowSums(outer(oldind,v,">=")))
    ## modify dimensions
    M@Dim <- M@Dim + c(length(v),0L)
    M@i <- newind[match(M@i,oldind)]
    M
}
empty_texts_idx = which(texts=="")
position_after_insertion = empty_texts_idx - 1:(length(empty_texts_idx))

a3 = add_rows_2(a2, position_after_insertion)
print(a3, max_ndoc=10)
Document-feature matrix of: 8 documents, 3 features (87.50% sparse) and 0 docvars.
         features
docs        bonjour     hello      good
  text2.1 0         0         0        
  text3.1 0.4771213 0         0        
  text5.1 0         0.4771213 0        
  NA.NA   0         0         0        
  NA.NA   0         0         0.4771213
  NA.NA   0         0         0        
  NA.NA   0         0         0        
  NA.NA   0         0         0        

Which is what I want, and the empty texts have been added at the appropriate row in the matrix.

Question 1: I was wondering if there is a more efficient way to do this directly with the quanteda package...

Question 2: ...or at least a way that would not change the structure of the dfm object, since a3 and a do not have the same docvars attribute.

print(a3@docvars)
  docname_ docid_ segid_
1    text2  text2      1
2    text3  text3      1
3    text5  text5      1

print(docnames(a3))
[1] "text2" "text3" "text5"

print(a@docvars)
  docname_ docid_ segid_
1    text1  text1      1
2    text2  text2      1
3    text3  text3      1
4    text4  text4      1
5    text5  text5      1
6    text6  text6      1
7    text7  text7      1
8    text8  text8      1

I was able to have a "correct" format for a3 by running the following lines of code

# necessary to print proper names in 'docs' column
new_docvars = data.frame(docname_=paste0("text",1:length(textes3)) %>% as.factor(), docid_=paste0("text",1:length(textes3))%>% as.factor(), segid_=rep(1,length(textes3)))
a3@docvars = new_docvars

# The following line is necessary for cv.glmnet to run using a3 as covariates
docnames(a3) <- paste0("text",1:length(textes3)) 
# seems equivalent to a3@Dimnames$docs <- paste0("text",1:length(textes3))

print(a3, max_ndoc=10)
Document-feature matrix of: 8 documents, 3 features (87.50% sparse) and 0 docvars.
       features
docs      bonjour     hello      good
  text1 0         0         0        
  text2 0.4771213 0         0        
  text3 0         0.4771213 0        
  text4 0         0         0        
  text5 0         0         0.4771213
  text6 0         0         0        
  text7 0         0         0        
  text8 0         0         0

print(a3@docvars) # this is now as expected
  docname_ docid_ segid_
1    text1  text1      1
2    text2  text2      1
3    text3  text3      1
4    text4  text4      1
5    text5  text5      1
6    text6  text6      1
7    text7  text7      1
8    text8  text8      1
print(docnames(a3)) # this is now as expected
[1] "text1" "text2" "text3" "text4" "text5" "text6" "text7" "text8"

I need to change docnames(a3) because I want to use a3 as covariates for a model I want to train with cv.glmet, but I get an error if I don't change the document names for a3. Again, is this the correct way to proceed with quanteda? I felt like manually changing docvars was not the proper way to do it, and I could not find anything online about that. Any insights on that would be appreciated.

Thanks!


Solution

  • I do not know if it is a good idea to remove empty documents before computing tf-idf, but it easy to do restore removed documents with drop_docid = FALSE and fill = TRUE because quanteda keeps track of them.

    require(quanteda)
    #> Loading required package: quanteda
    #> Package version: 3.2.1
    #> Unicode version: 13.0
    #> ICU version: 66.1
    #> Parallel computing: 10 of 10 threads used.
    #> See https://quanteda.io for tutorials and examples.
    txt <- c("", "Bonjour!", "Hello, how are you", "", "Good", "", "", "")
    corp <- corpus(txt)
    dfmt <- dfm(tokens(corp))
    dfmt
    #> Document-feature matrix of: 8 documents, 8 features (87.50% sparse) and 0 docvars.
    #>        features
    #> docs    bonjour ! hello , how are you good
    #>   text1       0 0     0 0   0   0   0    0
    #>   text2       1 1     0 0   0   0   0    0
    #>   text3       0 0     1 1   1   1   1    0
    #>   text4       0 0     0 0   0   0   0    0
    #>   text5       0 0     0 0   0   0   0    1
    #>   text6       0 0     0 0   0   0   0    0
    #> [ reached max_ndoc ... 2 more documents ]
    
    
    dfmt2 <- dfm_subset(dfmt, ntoken(dfmt) > 0, drop_docid = FALSE) %>% 
      dfm_tfidf()
    dfmt2
    #> Document-feature matrix of: 3 documents, 8 features (66.67% sparse) and 0 docvars.
    #>        features
    #> docs      bonjour         !     hello         ,       how       are       you
    #>   text2 0.4771213 0.4771213 0         0         0         0         0        
    #>   text3 0         0         0.4771213 0.4771213 0.4771213 0.4771213 0.4771213
    #>   text5 0         0         0         0         0         0         0        
    #>        features
    #> docs         good
    #>   text2 0        
    #>   text3 0        
    #>   text5 0.4771213
    
    dfmt3 <- dfm_group(dfmt2, fill = TRUE, force = TRUE)
    dfmt3
    #> Document-feature matrix of: 8 documents, 8 features (87.50% sparse) and 0 docvars.
    #>        features
    #> docs      bonjour         !     hello         ,       how       are       you
    #>   text1 0         0         0         0         0         0         0        
    #>   text2 0.4771213 0.4771213 0         0         0         0         0        
    #>   text3 0         0         0.4771213 0.4771213 0.4771213 0.4771213 0.4771213
    #>   text4 0         0         0         0         0         0         0        
    #>   text5 0         0         0         0         0         0         0        
    #>   text6 0         0         0         0         0         0         0        
    #>        features
    #> docs         good
    #>   text1 0        
    #>   text2 0        
    #>   text3 0        
    #>   text4 0        
    #>   text5 0.4771213
    #>   text6 0        
    #> [ reached max_ndoc ... 2 more documents ]
    

    Created on 2022-06-16 by the reprex package (v2.0.1)