Search code examples
rperformancenearest-neighbormultivariate-time-series

Fast way to find k nearest neighbors from only past, in R?


Assume I have a numeric data with 10,000 rows and 8 columns. I want to obtain the first k neighbors for each row (skipping first 1,000 rows) using euclidean distance but the catch is for each row I am only interested in the previous rows. (e.g. for the 2001th row, I only search first 2000 rows).

Changing the reference for each row is too slow. The fastest function I could write was using RANN to get 2k (or 5k) closest neighbors then filtering out the future observations.

A slow example: (200 rows, 5 columns, 3 nearest neighbors)

data = matrix(rnorm(1000), nrow = 200, ncol = 5)
result <- list()
for (i in c(101:200)) {
    distances <- apply(data[1:(i-1),], 1, function(x) {
        dist(rbind(x, data[i, ]))
    })
    neighbors <- sort(distances, index.return = TRUE)$ix[1:3]
    result[[i - 100]] <- neighbors
}

The fastest approach was using RANN::nn2(data = data[1:200,], query = data[101:200,], k = 2*k), then filtering the future ones (and hope to have at least k values).

The filtering part and unneccessary computation of nearest neighbors increase the time complexity significantly.

I would be glad to hear any suggestions.


Solution

  • A solution using RANN::nn2. The idea is to break the data into chunks and process each separately before combining the results. It processes a 10k-by-8 matrix in a fraction of a second.

    library(RANN) # for the nn2 function
    library(Rfast) # for the rowOrder function
    library(data.table) # for the rbindlist function
    
    nr <- 1e4L
    start <- 1001L
    nc <- 8L
    k <- 3L
    data <- matrix(rnorm(nr*nc), nr, nc)
    
    system.time({
      n <- 450L # chunk size
      n1 <- n - 1L
      
      nn1 <- rbindlist(
        lapply(
          # split the data into chunks
          as.data.frame(matrix(start:nr, n)),
          function(i) {
            # initialize a matrix with -Inf
            d <- matrix(Inf, n1, n1)
            # fill the lower diagonal of m with the negative of the distance matrix
            d[sequence(n1:1, seq(1, n1^2, n))] <- dist(data[i,])
            # get the nearest neighbor from previous chunks
            nn <- nn2(data[1:(i[1] - 1L),], data[i,], k)
            # bind the two distance matrices together
            out <- rowOrder(cbind(nn$nn.dists, rbind(Inf, d)))[,1:k]
            # which neighbors are from a previous chunk?
            iPrev <- which(out <= k)
            # indices of nearest neighbors from a previous chunk
            out[iPrev] <- nn$nn.idx[cbind(((iPrev - 1L) %% n) + 1L, out[iPrev])]
            # indices of nearest neighbors from current chunk
            out[-iPrev] <- out[-iPrev] - k + i[1] - 1L
            # convert to a data.table in order to use rbindlist
            as.data.table(out)
          }
        ), FALSE
      )
    })
    #>    user  system elapsed 
    #>    0.46    0.00    0.47
    

    Compare the result to a filtering approach. Note that the code below is not guaranteed to find the nearest previous neighbor, but it could be modified to iteratively increase k for rows that fail to do so.

    system.time({
      nn2 <- with(
        # get the 100 nearest neighbors
        nn2(data, data[start:nr,], 100),
        # find the nearest from a previous row
        matrix(nn.idx[cbind(1:nrow(nn.idx), c(rowOrder(1 - (nn.idx < start:nr), TRUE)[,1:k]))], ncol = k)
      )
    })
    #>    user  system elapsed 
    #>    1.31    0.00    1.31
    
    identical(unlist(nn1, 0, 0), c(nn2))
    #> [1] TRUE