Building text-similarity time series in a corpus of tweets

I want to measure the evolution of text similarity over time. My data frame consists on a column for tweet identifiers ( id) a column for dates with a daily frequency (date) and a column with the tidy text of the tweets (clean_text).

Here is a brief repex with some actual cleaned tweets:

final <- data.frame(id=1:5,Date=c(as.Date("2020-12-26"),as.Date("2020-12-26"),as.Date("2020-12-27"),
                    as.Date("2020-12-27"),as.Date("2020-12-27")),
              clean_text = c("americans died covid nfebruary people couple days",
                                        "cops crush peoples necks death eric garner",  
                                        "video clip tells george floyd resist arrest earlier claimed police officer", 
                                        "black americans terrible daily dangers outdoor spaces subjected unwarranted suspicion", 
                                         "announcement waiting minneapolis police officer derek chauvin charged manslaug"))

Since I want something that permits me to measure how similar/disimilar text became with time, I thought about using some similarity measures (i.e frequency, cosine similarity...), via textstat_simil from quanteda.

Here is my try:

require(quanteda)            
start <- as.Date("2020-12-26",format="%Y-%m-%d")
    end   <- as.Date("2020-12-27",format="%Y-%m-%d")
    
    theDate <- start
    i=1
    
    similarity<-data.frame(matrix(NA, nrow = as.numeric(end+1-start), ncol = 1))
    #value<-vector(mode="numeric")
    colnames(similarity)<-c("value")
    while (theDate <= end){
      if (nrow(subset(final,final$Date==theDate))>1){
        corp <- corpus(subset(final,final$Date==theDate),
                       docid_field = "id",
                       text_field = "clean_text")
        a<-as.matrix(dfm(corp) %>%
                       textstat_simil())
        
            similarity$value[i]<-sum(a[lower.tri(a)])/length(corp)
      }else{
        similarity$value[i]<-0
      } 
      i<-i+1
      theDate <- theDate + 1 
      
    }

In this code, I am summing over the lower triangular elements of the correlation matrix a and dividing by the number of tweets to get a "mean" of the correlation over the tweets. I believe this is the problematic step, since, when I apply this function to my corpus of tweets the similarity drops precisely in the moments where tweets are very related.

Also I must say that my goal is to obtain a time series with a column for a measure of similarity and another for dates so it can be plotted. Maybe there is a simpler npl method to track this?

edit: Just realised that the correlation entries in matrix a are negative. Not sure how to interpret that or whether there is an error in the coding then?

Solution

How about this:

library("quanteda")
## Package version: 3.1
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
library("quanteda.textstats")

dfmat <- final %>%
  corpus(text_field = "clean_text") %>%
  tokens() %>%
  dfm()

Now we will create a series of dfm objects, by date:

date_splits <- split(seq_len(ndoc(dfmat)), dfmat$Date)
date_splits
## $`2020-12-26`
## [1] 1 2
## 
## $`2020-12-27`
## [1] 3 4 5

dfmat_split <- lapply(date_splits, function(x) dfmat[x, ])
dfmat_split
## $`2020-12-26`
## Document-feature matrix of: 2 documents, 41 features (82.93% sparse) and 2 docvars.
##        features
## docs    americans died covid nfebruary people couple days cops crush peoples
##   text1         1    1     1         1      1      1    1    0     0       0
##   text2         0    0     0         0      0      0    0    1     1       1
## [ reached max_nfeat ... 31 more features ]
## 
## $`2020-12-27`
## Document-feature matrix of: 3 documents, 41 features (75.61% sparse) and 2 docvars.
##        features
## docs    americans died covid nfebruary people couple days cops crush peoples
##   text3         0    0     0         0      0      0    0    0     0       0
##   text4         1    0     0         0      0      0    0    0     0       0
##   text5         0    0     0         0      0      0    0    0     0       0
## [ reached max_nfeat ... 31 more features ]

Then, create a list of data.frame results, by date, by coercing each textstat_simil() matrix output into a data.frame, and then join these up in an output data.frame where we add back the date columns. This gives us a data.frame of (unique) pairwise similarities, by date:

simil_by_date <- lapply(dfmat_split, function(x) {
  textstat_simil(x) %>%
    as.data.frame()
})
simil_by_date
## $`2020-12-26`
##   document1 document2 correlation
## 1     text1     text2  -0.2058824
## 
## $`2020-12-27`
##   document1 document2 correlation
## 1     text3     text4 -0.34391797
## 2     text3     text5 -0.05514368
## 3     text4     text5 -0.30120725

df <- data.frame(
  date = as.Date(rep(names(simil_by_date), sapply(simil_by_date, nrow))),
  do.call(rbind, simil_by_date),
  row.names = NULL
)
df
##         date document1 document2 correlation
## 1 2020-12-26     text1     text2 -0.20588235
## 2 2020-12-27     text3     text4 -0.34391797
## 3 2020-12-27     text3     text5 -0.05514368
## 4 2020-12-27     text4     text5 -0.30120725

Now it's simple to aggregate the similarities using your function of choice. Here, I've used dplyr to do this by mean correlation:

library("dplyr")
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
df %>%
  group_by(date) %>%
  summarise(simil = mean(correlation))
## # A tibble: 2 x 2
##   date        simil
##   <date>      <dbl>
## 1 2020-12-26 -0.206
## 2 2020-12-27 -0.233