r text cosine-similarity quanteda sentence-similarity

Interpretation question: Textstat_similarity Quanteda

I have a dataset of 310,225 tweets. I want to find out how many tweets were same or similar. I calculated the similarity between the tweets using Quanteda's textstat frequency. I found that the frequency of distance 1 and 0.9999 in the similarity matrix as below:

0.9999            1 
 2288           162743

Here's my code:

dfmat_users <- dfm_data %>% 
dfm_select(min_nchar = 2) %>% 
dfm_trim(min_termfreq = 10) 

dfmat_users <- dfmat_users[ntoken(dfmat_users) > 10,]

tstat_sim <- textstat_simil(dfmat_users, method = "cosine", margin = "documents", min_simil = 0.9998)

table(tstat_sim@x) #result of this code is given above.

I need to find out the number of similar or same tweets in the dataset. How should I interpret the results above?

Solution

The easiest way is to convert the textstat_simil() output to a data.frame of unique pairs, and then filter the ones whose cosine value is above your threshold (here, .9999).

To illustrate, we can reshape the built-in inaugural address corpus into sentences, and then compute the similarity matrix on these, and then do the coercion to data.frame and use dplyr to filter the results you want.

library("quanteda", warn.conflicts = FALSE)
## Package version: 2.1.0.9000
## Parallel computing: 2 of 8 threads used.
## See https://quanteda.io for tutorials and examples.

sim_df <- data_corpus_inaugural %>%
  corpus_reshape(to = "sentences") %>%
  dfm() %>%
  textstat_simil(method = "cosine") %>%
  as.data.frame()

nrow(sim_df)
## [1] 12508670

You can adjust the condition below for your data to 0.9999 - here I'm using 0.99 as an illustration.

library("dplyr", warn.conflicts = FALSE)
filter(sim_df, cosine > .99)
##            document1       document2 cosine
## 1    1861-Lincoln.69 1861-Lincoln.71      1
## 2    1861-Lincoln.69 1861-Lincoln.73      1
## 3    1861-Lincoln.71 1861-Lincoln.73      1
## 4  1953-Eisenhower.6   1985-Reagan.6      1
## 5  1953-Eisenhower.6    1989-Bush.15      1
## 6      1985-Reagan.6    1989-Bush.15      1
## 7      1989-Bush.140  2009-Obama.108      1
## 8      1989-Bush.140   2013-Obama.87      1
## 9     2009-Obama.108   2013-Obama.87      1
## 10     1989-Bush.140    2017-Trump.9      1
## 11    2009-Obama.108    2017-Trump.9      1
## 12     2013-Obama.87    2017-Trump.9      1

(And: yeah, that's a very fast computation of cosine similarity between 12.5 million sentence pairs!)