Search code examples
rloopsnested-loopsdata-sciencedata-analysis

What is a better way to write this nested for loop in R?


I am writing a for loop to calculate a numerator which is part of a larger formula. I used a for loop but it is taking a lot of time to compute. What would be a better way to do this.

city is a dataframe with the following columns: pop, not.white, pct.not.white

  n <- nrow(city)

  numerator = 0

  for(i in 1:n) {

    ti <- city$pop[i]
    pi<- city$pct.not.white[i]

    for(j in 1:n) {

      tj <- city$pop[j]
      pj <- city$pct.not.white[j]

      numerator = numerator + (ti * tj) * abs(pi -pj)

    }

  }

Solution

  • Use the following toy data for result validation.

    set.seed(0)
    city <- data.frame(pop = runif(101), pct.not.white = runif(101))
    

    The most obvious "vectorization":

    # n <- nrow(city)
    titj <- tcrossprod(city$pop)
    pipj <- outer(city$pct.not.white, city$pct.not.white, "-")
    numerator <- sum(titj * abs(pipj))
    

    Will probably have memory problem if n > 5000.


    A clever workaround (exploiting symmetry; more memory efficient "vectorization"):

    ## see https://stackoverflow.com/a/52086291/4891738 for function: tri_ind
    n <- nrow(city)
    ij <- tri_ind(n, lower = TRUE, diag = FALSE)
    titj <- city$pop[ij$i] * city$pop[ij$j]
    pipj <- abs(city$pct.not.white[ij$i] - city$pct.not.white[ij$j])
    numerator <- 2 * crossprod(titj, pipj)[1]
    

    The ultimate solution is to write C / C++ loop, which I will not showcase.