I am using rdist iteratively in order to compute nearest neighbours for huge datasets. Currently I have a rather small matrix of 634,000 vectors with 6 columns.
As mentioned I'm using rdist to compute the distance of each vector to every single other vector, with each distance computation being a step. In addition at every step I run a function that computes k=1,2,3,4 nearest neighbours and takes the sum (effectively k=all neighbours).
###My function to compute k nearest neighbours from distance vector
knn <- function (vec,k) {
sum((sort(vec)[1:k+1]))
}
###My function to compute nearest neighbours iteratively for every vector
myfunc <- function (tab) {
rowsums <- numeric(nrow(tab)) ###Here I will save total sums
knnsums_log <- matrix(nrow=nrow(tab),ncol=4) ###Matrix for storing each of my kNN sums
for(i in 1:nrow(tab)) { ###For loop to compute distance and total sums
q<-as.matrix(rdist(tab[i,],tab))
rowsums[i] <- rowSums(q)
for (k in c(1:4)) { ###Nested loop to run my knn function
knnsums[i,k] <- knn(q,k)
}
}
return(cbind(rowsums,knnsums_log))
}
A sample of what the data looks like (634k rows of this)
X1 X2 X3 X4 X5 X6
1 0.00 0.02 0 0 0.02 -0.263309267
2 0.00 0.02 0 0 0.02 -0.171764667
3 0.00 0.02 0 0 0.02 -0.128784869
4 0.00 0.02 0 0 0.02 -0.905651733
For those unfamiliar with the function rdist gets Euclidean distance between arguements. It works far faster than a custom written function. It is more applicable than dist as dist only computes within matrix distances. I know technically that is what I'm doing but dist attempts to store that in memory and it is far far too large to even consider doing that.
How can I make the above work better? I've tried messing around with apply functions but get nothing useful. I hope I have explained everything clearly. If my math is correct, worst case guestimates it will take me over a week to run that code. I have very powerful servers to crunch this on. No GPUs however. I have not tried multicore (should have 12 available) but then again I don't know how I would delegate per core.
Thank you for your help.
Several tips:
0) profile your code using Rprof, with the line.profiling option
1) Matrices in R are column-wise. Because you compare your vectors between them, it will be much faster if you store them as columns of your matrix
2) I do not know where the rdist function comes from, but you should avoid the as.matrix(rdist(tab[i,],tab)) that will copy and create a new matrix
3) you can optimize your knn() function, that sorts 4 times the same vector
4) Why not just rdist(tab) ?