Search code examples
rk-means

More efficient way to compute the rowNorms in R?


I wrote a program using an unsupervised K-means algorithm to try and compress images. It now works but in comparison to Python it's incredibly slow! Specifically it's finding the rowNorms thats slow. The array X is 350000+ elements.

This is the particular function:

find_closest_centroids <- function(X, centroids) {
  m <- nrow(X)
  c <- integer(m)

  for(i in 1:m){
    distances = rowNorms(sweep(centroids,2,X[i,]))

    c[i] = which.min(distances)
  }
  return(c)
}

In Python I am able to do it like this:

def find_closest_centroids(X, centroids):
    m = len(X)
    c = np.zeros(m)

    for i in range(m):
        distances = np.linalg.norm(X[i] - centroids, axis=1)

        c[i] = np.argmin(distances)

    return c

Any recommendations?

Thanks.


Solution

  • As dvd280 has noted in his comment, R tends to do worse than many other languages in terms of performance. If are content with the performance of your code in Python, but need the function available in R, you might want to look into the reticulate package which provides an interface to python like the Rcpp package mentioned by dvd280 does for C++.

    If you still want to implement this natively in R, be mindful of the data structures you use. For rowwise operations, data frames are a poor choice as they are lists of columns. I'm not sure about the data structures in your code, but rowNorms() seems to be a matrix method. You might get more mileage out of a list of rows structure.

    If you feel like getting into dplyr, you could find this vignette on row-wise operations helpful. Make sure you have the latest version of the package, as the vignette is based on dplyr 1.0.

    The data.table package tends to yield the best performance for large data sets in R, but I'm not familiar with it, so I can't give you any further directions on that.