Search code examples
rcosine-similarityquanteda

Why does textstat_simil() with method "cosine" returns NA


I am computing cosine similarity over two dfm objects. One is my reference object which has dimensions 5 x 4,728 while the second dfm is my target object and has dimensions 2,325,329 x 40,595.

What I don't understand is why textstat_simil() returns NAs. I tried reproducing the "issue" but no luck so far. You can find the data at the following Dropbox links. Be advised that the target dfm contains only the first document.

  1. Reference dfm
  2. Target dfm

This is the code I am using. dfm_match() augments my reference dfm to match the number of features of the target object.

library(quanteda)

# make sure you load the two required dfms

reference_dfm = dfm_match(reference_dfm, featnames(target_dfm))
textstat_simil( target_dfm, reference_dfm, method = "cosine")

#> textstat_simil object; method = "cosine"
#>         negative slightly_negative neutral slightly_positive positive
#> text1.1       NA                NA      NA                NA       NA

Any idea?


Solution

  • Your target_dfm is entirely sparse (all 0s), so you can't calculate cosine similarity.

    target_df <- convert(target_dfm, "data.frame")
    sum(target_df[,2:ncol(target_df)] > 0)
    #> 0
    

    You can also note that when you print the dfm to console it says it is "100.0% sparse". Here is a dfm 1 value away from being sparse, and the calculation works.

    test_dfm <- dfm(corpus("adds"))
    test_dfm <- dfm_match(test_dfm, featnames(target_dfm))
    textstat_simil(test_dfm, reference_dfm2, method = "cosine")
    #> textstat_simil object; method = "cosine"
    #>         negative slightly_negative neutral slightly_positive positive
    #> text1.1        0                 0       0                 0        0