I am trying to compute tfidf on a dataset with a lot of empty documents. I wanted to compute tfidf without the empty documents, but still have as an output a dfm object with the original number of documents.
Here's an example :
texts = c("", "Bonjour!", "Hello, how are you", "", "Good", "", "", "")
a = texts %>%
tokens(tolower=T, remove_punct=T) %>%
dfm() %>%
dfm_wordstem() %>%
dfm_remove(stopwords("en")) %>%
dfm_tfidf()
print(a, max_ndoc=10)
Document-feature matrix of: 8 documents, 3 features (87.50% sparse) and 0 docvars.
features
docs bonjour hello good
text1 0 0 0
text2 0.90309 0 0
text3 0 0.90309 0
text4 0 0 0
text5 0 0 0.90309
text6 0 0 0
text7 0 0 0
text8 0 0 0
But IDF is affected by the number of empty documents, which I do not want. Therefore, I compute tfidf on the subset of non-empty documents like so :
a2 = texts %>%
tokens(tolower=T, remove_punct=T) %>%
dfm() %>%
dfm_subset(ntoken(.) > 0) %>%
dfm_wordstem() %>%
dfm_remove(stopwords("en")) %>%
dfm_tfidf()
print(a2, max_ndoc=10)
Document-feature matrix of: 3 documents, 3 features (66.67% sparse) and 0 docvars.
features
docs bonjour hello good
text2 0.4771213 0 0
text3 0 0.4771213 0
text5 0 0 0.4771213
I now want to have a sparse matrix with the same format as the first matrix, but with the previous values for the texts. I found this code on stackoverflow: https://stackoverflow.com/a/65635722
add_rows_2 <- function(M,v) {
oldind <- unique(M@i)
## new row indices
newind <- oldind + as.integer(rowSums(outer(oldind,v,">=")))
## modify dimensions
M@Dim <- M@Dim + c(length(v),0L)
M@i <- newind[match(M@i,oldind)]
M
}
empty_texts_idx = which(texts=="")
position_after_insertion = empty_texts_idx - 1:(length(empty_texts_idx))
a3 = add_rows_2(a2, position_after_insertion)
print(a3, max_ndoc=10)
Document-feature matrix of: 8 documents, 3 features (87.50% sparse) and 0 docvars.
features
docs bonjour hello good
text2.1 0 0 0
text3.1 0.4771213 0 0
text5.1 0 0.4771213 0
NA.NA 0 0 0
NA.NA 0 0 0.4771213
NA.NA 0 0 0
NA.NA 0 0 0
NA.NA 0 0 0
Which is what I want, and the empty texts have been added at the appropriate row in the matrix.
Question 1: I was wondering if there is a more efficient way to do this directly with the quanteda
package...
Question 2: ...or at least a way that would not change the structure of the dfm object, since a3
and a
do not have the same docvars
attribute.
print(a3@docvars)
docname_ docid_ segid_
1 text2 text2 1
2 text3 text3 1
3 text5 text5 1
print(docnames(a3))
[1] "text2" "text3" "text5"
print(a@docvars)
docname_ docid_ segid_
1 text1 text1 1
2 text2 text2 1
3 text3 text3 1
4 text4 text4 1
5 text5 text5 1
6 text6 text6 1
7 text7 text7 1
8 text8 text8 1
I was able to have a "correct" format for a3 by running the following lines of code
# necessary to print proper names in 'docs' column
new_docvars = data.frame(docname_=paste0("text",1:length(textes3)) %>% as.factor(), docid_=paste0("text",1:length(textes3))%>% as.factor(), segid_=rep(1,length(textes3)))
a3@docvars = new_docvars
# The following line is necessary for cv.glmnet to run using a3 as covariates
docnames(a3) <- paste0("text",1:length(textes3))
# seems equivalent to a3@Dimnames$docs <- paste0("text",1:length(textes3))
print(a3, max_ndoc=10)
Document-feature matrix of: 8 documents, 3 features (87.50% sparse) and 0 docvars.
features
docs bonjour hello good
text1 0 0 0
text2 0.4771213 0 0
text3 0 0.4771213 0
text4 0 0 0
text5 0 0 0.4771213
text6 0 0 0
text7 0 0 0
text8 0 0 0
print(a3@docvars) # this is now as expected
docname_ docid_ segid_
1 text1 text1 1
2 text2 text2 1
3 text3 text3 1
4 text4 text4 1
5 text5 text5 1
6 text6 text6 1
7 text7 text7 1
8 text8 text8 1
print(docnames(a3)) # this is now as expected
[1] "text1" "text2" "text3" "text4" "text5" "text6" "text7" "text8"
I need to change docnames(a3) because I want to use a3 as covariates for a model I want to train with cv.glmet
, but I get an error if I don't change the document names for a3. Again, is this the correct way to proceed with quanteda? I felt like manually changing docvars was not the proper way to do it, and I could not find anything online about that. Any insights on that would be appreciated.
Thanks!
I do not know if it is a good idea to remove empty documents before computing tf-idf, but it easy to do restore removed documents with drop_docid = FALSE
and fill = TRUE
because quanteda keeps track of them.
require(quanteda)
#> Loading required package: quanteda
#> Package version: 3.2.1
#> Unicode version: 13.0
#> ICU version: 66.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.
txt <- c("", "Bonjour!", "Hello, how are you", "", "Good", "", "", "")
corp <- corpus(txt)
dfmt <- dfm(tokens(corp))
dfmt
#> Document-feature matrix of: 8 documents, 8 features (87.50% sparse) and 0 docvars.
#> features
#> docs bonjour ! hello , how are you good
#> text1 0 0 0 0 0 0 0 0
#> text2 1 1 0 0 0 0 0 0
#> text3 0 0 1 1 1 1 1 0
#> text4 0 0 0 0 0 0 0 0
#> text5 0 0 0 0 0 0 0 1
#> text6 0 0 0 0 0 0 0 0
#> [ reached max_ndoc ... 2 more documents ]
dfmt2 <- dfm_subset(dfmt, ntoken(dfmt) > 0, drop_docid = FALSE) %>%
dfm_tfidf()
dfmt2
#> Document-feature matrix of: 3 documents, 8 features (66.67% sparse) and 0 docvars.
#> features
#> docs bonjour ! hello , how are you
#> text2 0.4771213 0.4771213 0 0 0 0 0
#> text3 0 0 0.4771213 0.4771213 0.4771213 0.4771213 0.4771213
#> text5 0 0 0 0 0 0 0
#> features
#> docs good
#> text2 0
#> text3 0
#> text5 0.4771213
dfmt3 <- dfm_group(dfmt2, fill = TRUE, force = TRUE)
dfmt3
#> Document-feature matrix of: 8 documents, 8 features (87.50% sparse) and 0 docvars.
#> features
#> docs bonjour ! hello , how are you
#> text1 0 0 0 0 0 0 0
#> text2 0.4771213 0.4771213 0 0 0 0 0
#> text3 0 0 0.4771213 0.4771213 0.4771213 0.4771213 0.4771213
#> text4 0 0 0 0 0 0 0
#> text5 0 0 0 0 0 0 0
#> text6 0 0 0 0 0 0 0
#> features
#> docs good
#> text1 0
#> text2 0
#> text3 0
#> text4 0
#> text5 0.4771213
#> text6 0
#> [ reached max_ndoc ... 2 more documents ]
Created on 2022-06-16 by the reprex package (v2.0.1)