I am generating a matrix using the lsa package in R. After the matrix is created, I would like to calculate the cosine similarity between specific pairs of documents (columns) in the matrix.
Currently, I am doing this with nested for-loops, and it is monstrously slow. In the code below, there are 150 sourceIDs and 6413 targetIDs, for a total of 961.950 comparisons. After an hour and a half on my number-crunching machine, it has only gotten through ~300k of them.
For more info, sourceIDs and targetIDs are vectors of column names, loaded in from two files containing those names. I want to apply the cosine function between all of the source->target pairs. The columns are indexed by the document name, which is a string.
I am sure there is a much faster way of doing this with apply, but I just cant wrap my head around it.
library(lsa)
# tf function
real_tf <- function(m)
{
return (sweep(m, MARGIN=2, apply(m, 2, max), "/"))
}
#idf function
real_idf <- function(m)
{
df = rowSums(lw_bintf(m), na.rm=TRUE)
return (log(ncol(m)/df))
}
#load corpus
lsa.documents <- textmatrix(args[1], minWordLength=1, minDocFreq=0)
# compute tf-idf
lsa.weighted_documents <- real_tf(lsa.documents) * real_idf(lsa.documents)
# compute svd
lsa.nspace <- lsa(lsa.weighted_documents, dims = as.integer(args[5]))
lsa.matrix <- diag(lsa.nspace$sk) %*% t(lsa.nspace$dk)
# compute similarities
lsa.sourceIDs <- scan(args[2], what = character())
lsa.targetIDs <- scan(args[3], what = character())
lsa.similarities <- data.frame(SourceID=character(), TargetID=character(), Score=numeric(), stringsAsFactors=FALSE)
k <- 1
for (i in lsa.sourceIDs)
{
for (j in lsa.targetIDs)
{
lsa.similarities[k,] <- c(i, j, cosine(lsa.matrix[,i], lsa.matrix[,j]))
k <- k + 1
}
}
lsa.ranklist <- lsa.similarities[order(lsa.similarities$Score, decreasing=TRUE),]
# save ranklist
write.table(lsa.ranklist, args[4], sep="\t", quote=FALSE, col.names=FALSE, row.names=FALSE)
Edit: Reproducible example
# cosine function from lsa package
cosine <- function( x, y )
{
return ( crossprod(x,y) / sqrt( crossprod(x)*crossprod(y) ) )
}
theMatrix <- structure(c(-0.0264639232505822, -0.0141165039351167, -0.0280459775632757,
-0.041211247161448, -0.00331565717239375, -0.0291161345945683,
-0.0451167802746869, -0.0116214407383401, -0.0381080747718958,
-1.36693644389599, 0.274747343110076, 0.128100677705483, -0.401760905661056,
-1.24876927957167, 0.368479552862631, -0.459711112157286, -0.544344448332346,
-0.765378939625159, -1.28612431910459, 0.293455499695499, 0.025167452173962
), .Dim = c(3L, 7L), .Dimnames = list(NULL, c("doc1", "doc2", "doc3",
"doc4", "doc5", "doc6", "doc7")))
sources <- c("doc1", "doc2", "doc3")
targets <- c("doc4", "doc5", "doc6", "doc7")
similarities <- data.frame(SourceID=character(), TargetID=character(), Score=numeric(), stringsAsFactors=FALSE)
k <- 1
for (i in sources)
{
for (j in targets)
{
similarities[k,] <- c(i, j, cosine(theMatrix[,i], theMatrix[,j]))
k <- k + 1
}
}
ranklist <- similarities[order(similarities$Score, decreasing=TRUE),]
write.table(ranklist, "C:\\Temp\\outputfile.txt", sep="\t", quote=FALSE, col.names=FALSE, row.names=FALSE)
Which produces (outputfile.txt):
doc1 doc6 0.962195242094352
doc3 doc6 0.893461576046585
doc2 doc6 0.813856201398669
doc2 doc7 0.768837903803964
doc2 doc4 0.730093288388069
doc3 doc7 0.675640649189972
doc3 doc4 0.635982900340315
doc1 doc7 0.53871688669971
doc1 doc4 0.499235059782688
doc1 doc5 0.320383772495164
doc3 doc5 0.226751624753921
doc2 doc5 0.144680489733846
Ok, thanks for the reproducible example. Here is a possible solution. Let's first split your theMatrix
into source and target matrices. We do not need to use the names here, as we will not use loops:
matrix1 <- theMatrix[,1:3]
matrix2 <- theMatrix[,4:7]
Then we will create a function to loop through every column of matrix2, keeping a single column from matrix1 constant:
cycleM2 <- function(x) {
# x is a vector from matrix1
apply(matrix2,2,cosine,x)
}
Finally, we will supply this function to every column of matrix1:
(mydata <- apply(matrix1,2,cycleM2))
# doc1 doc2 doc3
# doc4 0.4992351 0.7300933 0.6359829
# doc5 0.3203838 0.1446805 0.2267516
# doc6 0.9621952 0.8138562 0.8934616
# doc7 0.5387169 0.7688379 0.6756406
Finally, if you really need your original data format:
require(reshape2)
melt(mydata)
This should speed up your code nicely. Also, as @flodel has noticed, when you use loops, pre-allocate your (empty) target object in memory, filling it e.g. with NA. Memory allocations are the most costly in terms of time, and that is why your original loop was so slow.
EDIT:
A better form using pure function would probably be:
pairwiseCosine <- function(matrix1,matrix2) {
apply(matrix1,2,function(x){
apply(matrix2,2,cosine,x)
})
}
pairwiseCosine(theMatrix[,1:3],theMatrix[,4:7])