Search code examples
roptimizationfor-loopdistanceneighbours

Optimizing (Vectorizing?) For loop with nested loop in R


I am using rdist iteratively in order to compute nearest neighbours for huge datasets. Currently I have a rather small matrix of 634,000 vectors with 6 columns.

As mentioned I'm using rdist to compute the distance of each vector to every single other vector, with each distance computation being a step. In addition at every step I run a function that computes k=1,2,3,4 nearest neighbours and takes the sum (effectively k=all neighbours).

###My function to compute k nearest neighbours from distance vector

    knn <- function (vec,k) {
      sum((sort(vec)[1:k+1]))
    }

###My function to compute nearest neighbours iteratively for every vector
myfunc <- function (tab) {

  rowsums <- numeric(nrow(tab)) ###Here I will save total sums
  knnsums_log <- matrix(nrow=nrow(tab),ncol=4) ###Matrix for storing each of my kNN sums

  for(i in 1:nrow(tab)) { ###For loop to compute distance and total sums
    q<-as.matrix(rdist(tab[i,],tab))
    rowsums[i] <- rowSums(q)

     for (k in c(1:4)) { ###Nested loop to run my knn function
     knnsums[i,k] <- knn(q,k) 
    }

  }

  return(cbind(rowsums,knnsums_log))
}

A sample of what the data looks like (634k rows of this)

    X1  X2  X3  X4  X5  X6
1   0.00    0.02    0   0   0.02    -0.263309267
2   0.00    0.02    0   0   0.02    -0.171764667
3   0.00    0.02    0   0   0.02    -0.128784869
4   0.00    0.02    0   0   0.02    -0.905651733

For those unfamiliar with the function rdist gets Euclidean distance between arguements. It works far faster than a custom written function. It is more applicable than dist as dist only computes within matrix distances. I know technically that is what I'm doing but dist attempts to store that in memory and it is far far too large to even consider doing that.

How can I make the above work better? I've tried messing around with apply functions but get nothing useful. I hope I have explained everything clearly. If my math is correct, worst case guestimates it will take me over a week to run that code. I have very powerful servers to crunch this on. No GPUs however. I have not tried multicore (should have 12 available) but then again I don't know how I would delegate per core.

Thank you for your help.


Solution

  • Several tips:

    0) profile your code using Rprof, with the line.profiling option

    1) Matrices in R are column-wise. Because you compare your vectors between them, it will be much faster if you store them as columns of your matrix

    2) I do not know where the rdist function comes from, but you should avoid the as.matrix(rdist(tab[i,],tab)) that will copy and create a new matrix

    3) you can optimize your knn() function, that sorts 4 times the same vector

    4) Why not just rdist(tab) ?