I have a dataset of 310,225 tweets. I want to find out how many tweets were same or similar. I calculated the similarity between the tweets using Quanteda's textstat frequency. I found that the frequency of distance 1 and 0.9999 in the similarity matrix as below:
0.9999 1
2288 162743
Here's my code:
dfmat_users <- dfm_data %>%
dfm_select(min_nchar = 2) %>%
dfm_trim(min_termfreq = 10)
dfmat_users <- dfmat_users[ntoken(dfmat_users) > 10,]
tstat_sim <- textstat_simil(dfmat_users, method = "cosine", margin = "documents", min_simil = 0.9998)
table(tstat_sim@x) #result of this code is given above.
I need to find out the number of similar or same tweets in the dataset. How should I interpret the results above?
The easiest way is to convert the textstat_simil()
output to a data.frame of unique pairs, and then filter the ones whose cosine value is above your threshold (here, .9999).
To illustrate, we can reshape the built-in inaugural address corpus into sentences, and then compute the similarity matrix on these, and then do the coercion to data.frame and use dplyr to filter the results you want.
library("quanteda", warn.conflicts = FALSE)
## Package version: 2.1.0.9000
## Parallel computing: 2 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
sim_df <- data_corpus_inaugural %>%
corpus_reshape(to = "sentences") %>%
dfm() %>%
textstat_simil(method = "cosine") %>%
as.data.frame()
nrow(sim_df)
## [1] 12508670
You can adjust the condition below for your data to 0.9999 - here I'm using 0.99 as an illustration.
library("dplyr", warn.conflicts = FALSE)
filter(sim_df, cosine > .99)
## document1 document2 cosine
## 1 1861-Lincoln.69 1861-Lincoln.71 1
## 2 1861-Lincoln.69 1861-Lincoln.73 1
## 3 1861-Lincoln.71 1861-Lincoln.73 1
## 4 1953-Eisenhower.6 1985-Reagan.6 1
## 5 1953-Eisenhower.6 1989-Bush.15 1
## 6 1985-Reagan.6 1989-Bush.15 1
## 7 1989-Bush.140 2009-Obama.108 1
## 8 1989-Bush.140 2013-Obama.87 1
## 9 2009-Obama.108 2013-Obama.87 1
## 10 1989-Bush.140 2017-Trump.9 1
## 11 2009-Obama.108 2017-Trump.9 1
## 12 2013-Obama.87 2017-Trump.9 1
(And: yeah, that's a very fast computation of cosine similarity between 12.5 million sentence pairs!)