I am computing cosine similarity over two dfm
objects. One is my reference object which has dimensions 5 x 4,728 while the second dfm
is my target object and has dimensions 2,325,329 x 40,595.
What I don't understand is why textstat_simil()
returns NAs. I tried reproducing the "issue" but no luck so far. You can find the data at the following Dropbox links. Be advised that the target dfm
contains only the first document.
This is the code I am using. dfm_match()
augments my reference dfm
to match the number of features of the target object.
library(quanteda)
# make sure you load the two required dfms
reference_dfm = dfm_match(reference_dfm, featnames(target_dfm))
textstat_simil( target_dfm, reference_dfm, method = "cosine")
#> textstat_simil object; method = "cosine"
#> negative slightly_negative neutral slightly_positive positive
#> text1.1 NA NA NA NA NA
Any idea?
Your target_dfm
is entirely sparse (all 0s), so you can't calculate cosine similarity.
target_df <- convert(target_dfm, "data.frame")
sum(target_df[,2:ncol(target_df)] > 0)
#> 0
You can also note that when you print the dfm
to console it says it is "100.0% sparse". Here is a dfm
1 value away from being sparse, and the calculation works.
test_dfm <- dfm(corpus("adds"))
test_dfm <- dfm_match(test_dfm, featnames(target_dfm))
textstat_simil(test_dfm, reference_dfm2, method = "cosine")
#> textstat_simil object; method = "cosine"
#> negative slightly_negative neutral slightly_positive positive
#> text1.1 0 0 0 0 0