Search code examples
rnlpbigdataquanteda

Computing cosine similarities on a large corpus in R using quanteda


I am trying to work with a very large corpus of about 85,000 tweets that I'm trying to compare to dialog from television commercials. However, due to the size of my corpus, I am unable to process the cosine similarity measure without getting the "Error: cannot allocate vector of size n" message, (26 GB in my case).

I am already running R 64 bit on a server with lots of memory. I've also tried using the AWS on the server with the most memory, (244 GB), but to no avail, (same error).

Is there a way to use a package like fread to get around this memory limitation or do I just have to invent a way to break up my data? Thanks much for the help, I've appended the code below:

x <- NULL
y <- NULL
num <- NULL
z <- NULL
ad <- NULL
for (i in 1:nrow(ad.corp$documents)){
  num <- i
  ad <- paste("ad.num",num,sep="_")
  x <- subset(ad.corp, ad.corp$documents$num== yoad)
  z <- x + corp.all
  z$documents$texts <- as.character(z$documents$texts)
  PolAdsDfm <- dfm(z, ignoredFeatures = stopwords("english"), groups = "num",stem=TRUE, verbose=TRUE, removeTwitter=TRUE)
  PolAdsDfm <- tfidf(PolAdsDfm)
  y <- similarity(PolAdsDfm, ad, margin="documents",n=20, method = "cosine", normalize = T)
  y <- sort(y, decreasing=T)
  if (y[1] > .7){assign(paste(ad,x$documents$texts,sep="--"), y)}
  else {print(paste(ad,"didn't make the cut", sep="****"))}  
}

Solution

  • The error was most likely caused by previous versions of quanteda (before 0.9.1-8, on GitHub as of 2016-01-01) that coerced dfm object into dense matrixes in order to call proxy::simil(). The newer version now works on sparse dfm objects without coercion for method = "correlation" and method = "cosine". (With more sparse methods to come soon.)

    I can't really follow what you are doing in the code, but it looks like you are getting pairwise similarities between documents aggregated as groups. I would suggest the following workflow:

    1. Create your dfm with the groups option for all groups of texts you want to compare.

    2. Weight this dfm with tfidf() as you have done.

    3. Use y <- textstat_simil(PolAdsDfm, margin = "documents", method = "cosine") and then coerce this to a full, symmetric matrix using as.matrix(y). All of your pairwise documents are then in that matrix, and you can select on the condition of being greater than your threshold of 0.7 directly from that object.

      Note that there is no need to normalise term frequencies with method = "cosine". In newer versions of quanteda, the normalize argument has been removed anyway, since I think it's a better workflow practice to weight the dfm prior to any computation of similarities, rather than building weightings into textstat_simil().

    Final note: I strongly suggest not accessing the internals of a corpus object using the method you have here, since those internals may change and then break your code. Use texts(z) instead of z$documents$texts, for instance, and docvars(ad.corp, "num") instead of ad.corp$documents$num.