How to efficiently retrieve top K-similar vectors by cosine similarity using R?

I'm working on a high-dimensional problem (~4k terms) and would like to retrieve top k-similar (by cosine similarity) and can't afford to do a pair-wise calculation.

My training set is 6million x 4k matrix and I would like to make predictions for 600k x 4k matrix.

What is the most efficient way to retrieve the k-similar items for each item in my 600k x 4k matrix?

Ideally, I would like to get a matrix which is 600k x 10 (i.e., top 10-similar items for each of the 600k items).

ps: I've researched the SO website and found almost all "cosine similarity in R" questions refer to cosine_sim(vector1, vector2). But this question refers to cosine_sim(matrix1, matrix2).

Update The following code uses a naive method to find the cosine similarity between each row in the testset and every row in the training set.

set.seed(123)
train<-matrix(round(runif(30),0),nrow=6,ncol=5)
set.seed(987)
test<-matrix(round(runif(20),0),nrow=4,ncol=5)
train

[1,]    0    1    1    0    1    
[2,]    1    1    1    1    1    
[3,]    0    1    0    1    1    
[4,]    1    0    1    1    1    
[5,]    1    1    0    1    0    
[6,]    0    0    0    1    0

test

[1,]    0    1    1    0    0
[2,]    1    0    1    0    1
[3,]    1    0    0    0    0
[4,]    1    0    0    1    1

coSim<-function(mat1, mat2, topK){
require(plyr)
#mat2: is the testset
#mat1: is the training set. We will find cosine similarity between each row in testset and every row in trainingset.
#topK: user-input. for each row in testset we will return 'topk' similar rows(index) from the testset

#set up an empty result matrix. nrow(result) will be the same as the cartesian product between mat1 & mat2.
result<-matrix(rep(NA, nrow(mat1)*nrow(mat2)), nrow=nrow(mat1)*nrow(mat2), ncol=3)
k=1
for(i in 1:nrow(mat2)){
    for(j in 1:nrow(mat1)){
    result[k,1]<-i
    result[k,2]<-j
    result[k,3]<-crossprod(mat1[j,], mat2[i,])/sqrt(crossprod(mat1[j,]) * crossprod(mat2[i,]))
    k<-k+1
        }
    }
#sort the result matrix by cosine similarity found for each row in testset. not sure how to keep topK from each group so convert to df
result<-as.data.frame(result)
colnames(result)<-c("testRowId", "trainRowId","CosineSimilarity")
result<-ddply(result, "testRowId", function(x) head(x[order(x$CosineSimilarity, decreasing = TRUE) , ], topK))
resultMat<-matrix(result$trainRowId, nrow=nrow(mat2), ncol=topK,byrow=T)
finalResult<-list(similarity=result, index=resultMat)
}

system.time(cosineSim<-coSim(train, test, topK=2)) #0.12 secs
cosineSim
$similarity
  testRowId trainRowId CosineSimilarity
1         1          1        0.8164966
2         1          2        0.6324555
3         2          4        0.8660254
4         2          2        0.7745967
5         3          5        0.5773503
6         3          4        0.5000000
7         4          4        0.8660254
8         4          2        0.7745967

$index
     [,1] [,2]
[1,]    1    2
[2,]    4    2
[3,]    5    4
[4,]    4    2


set.seed(123)
train<-matrix(round(runif(1000000),0),nrow=5000,ncol=200)
set.seed(987)
test<-matrix(round(runif(400000),0),nrow=2000,ncol=200)
system.time(cosineSim<-coSim(train, test, topK=50)) #380secs

When I run the same function with 5000x200 matrix for training and 2000x200 matrix for testing, it took over 380secs.

Ideally, I would like to see some ideas where I do not have to calculate the similarity between each and every row. If that is not possible, some pointers on how to vectorise the above code will be helpful.

Solution

No need to compute the similarity for every row. You can use this instead:

coSim2<-function(mat1, mat2, topK){
    #similarity computation:

    xy <- tcrossprod(mat1, mat2)
    xx <- rowSums(mat1^2)
    yy <- rowSums(mat2^2)
    result <- xy/sqrt(outer(xx,yy))

    #top similar rows from train (per row in test):

    top <- apply(result, 2, order, decreasing=TRUE)[1:topK,]
    result_df <- data.frame(testRowId=c(col(top)), trainRowId=c(top))
    result_df$CosineSimilarity <- result[as.matrix(result_df[,2:1])]
    list(similarity=result_df, index=t(top))
}

Test data (I've reduced your train matrix)

set.seed(123)
train<-matrix(round(runif(100000),0),nrow=500,ncol=200)
set.seed(987)
test<-matrix(round(runif(400000),0),nrow=2000,ncol=200)

Result:

> system.time(cosineSim<-coSim(train, test, topK=50)) #380secs
   user  system elapsed 
  41.71    1.59   43.72 

> system.time(cosineSim2<-coSim2(train, test, topK=50)) #380secs
   user  system elapsed 
   0.46    0.02    0.49

Using your full 5000 x 200 train matrix, coSim2 runs in 7.8 sec.

Also note:

> any(cosineSim$similarity != cosineSim2$similarity)
[1] FALSE
> any(cosineSim$index != cosineSim2$index)
[1] FALSE

You can't use identical because my function returns integers instead of doubles for row IDs.