Search code examples
rfor-loopstatisticssimulationmontecarlo

Do I need to reduce for-loop in R and how?


I am a freshman in R coding, but I've heard that loop in R is much slower than other language like Python or C. So do I need to reduce loop when coding in R?

Specifically, in this simulation code, how can I improve my poor coding skill?

library(moments)
n <- c(5:20)
m <- c(1:10000)
skew <- c()
kurt <- c()
for(num in n){
  beta1 <- c()
  beta2 <- c()
  for(i in m){
    set.seed(num * 10000 + i)
    x <- rnorm(num, mean = 0, sd = 1)
    beta1 <- c(beta1, skewness(x))
    beta2 <- c(beta2, kurtosis(x) - 3)
  }
  skew <- c(skew, quantile(beta1, probs = c(0, 0.01, 0.1, 0.2, 0.5, 0.8, 0.9, 0.99, 1)))
  kurt <- c(kurt, quantile(beta2, probs = c(0, 0.01, 0.1, 0.2, 0.5, 0.8, 0.9, 0.99, 1)))
}

Solution

  • One main advantage of not using for loops in R is to exploit its vectorization. So while in languages like Python or C you code vector calculations for each element of a vector, in R you can conveniently code the calculation for the entire vector at once (see Edit below) and also reduce computation time by actually using fast underlying C, Fortran, etc. functions.

    I would put all the calculations you want to do for a single sample size into a function statFUN and put it into an lapply to loop over the vector of sample sizes n.

    For the quantiles we either could use apply or matrixStats::rowQuantiles which I recommend because it's faster.

    set.seed() should be needed just one time before running the lapply, all the results will be reproducible with that one seed.

    n <- 5:20  ## different sample sizes
    m <- 1e4   ## number of replications in each iteration
    probs <- c(0, 0.01, 0.1, 0.2, 0.5, 0.8, 0.9, 0.99, 1)
    
    library(moments)
    library(matrixStats)
    
    statFUN <- function(i, num) {
      r <- replicate(i, {
        x <- rnorm(num, mean=0, sd=1)
        c(kurt=kurtosis(x) - 3, skew=skewness(x))
      })
      # t(apply(r, 1, quantile, probs=probs))  ## using base R
      rowQuantiles(r, probs=probs)  ## using matrixStats
    }
    
    set.seed(42)
    res <- lapply(n, statFUN, m)
    

    Result

    The result is a list of the quantiles of kurtosis and skewness quantiles for each sample size.

    res
    # [[1]]
    #               0%          1%         10%         20%
    # kurt -0.04710729 -0.04658709 -0.04190536 -0.03670343
    # skew -0.03045563 -0.02969417 -0.02284104 -0.01522645
    #              50%          80%           90%         99%
    # kurt -0.03388803 -0.006250622  1.068998e-03 0.007656657
    # skew -0.01028591 -0.006132523 -5.883157e-05 0.005407491
    #             100%
    # kurt 0.008388619
    # skew 0.006014860
    # 
    # [[2]]
    #               0%          1%         10%         20%
    # kurt -0.09089922 -0.08859363 -0.06784329 -0.04478737
    # skew -0.03252828 -0.03165837 -0.02382918 -0.01513009
    #               50%          80%        90%        99%
    # kurt -0.023634727 -0.005277533 0.01038904 0.02448896
    # skew  0.003433589  0.017711708 0.01947178 0.02105585
    #            100%
    # kurt 0.02605562
    # skew 0.02123186
    #
    # [...]
    

    where

    length(res)
    # [1] 16
    

    Edit

    Here a small example to better illustrate what is actually meant by vectorization in R. While in most programming languages the addition of two vectors is coded element-wise, in R the addition of vectors may be coded directly (i.e. in a vectorized manner).

    a <- 1:9
    b <- rev(a)
    
    ## element wise addition of vectors a and b
    s1 <- c()
    for (i in seq(a)) {
      s1[i] <- a[i] + b[i]
    }
    s1
    # [1] 10 10 10 10 10 10 10 10 10
    
    ## direct addition of vectors a and b (i.e. vectorized)
    s2 <- a + b
    s2
    # [1] 10 10 10 10 10 10 10 10 10
    

    Instead of for loops we may look into the *apply family. However, mostly there are still for-loops hidden in it. (To see function codes type e.g. lapply without brackets or anything.)

    You might want to read e.g. those great Q&As:

    Note: The vectorization is actually just the language feature of R. So-called "vectorized functions" often use C, Fortran, etc. code internally, in which you still find for-loops at the end, but in a much faster language, though. See for instance the source code of summary.c which is called when we use sum().