Search code examples
rstatisticssimulationmle

How to make this R code (for loop) more efficient?


I am doing a simulation study and I wrote the following R code. Is there anyway to write this code without using two for loop, or make it more efficient (run faster)?

S = 10000
n = 100
v = c(5,10,50,100)
beta0.mle = matrix(NA,S,length(v))  #creating 4 S by n NA matrix 
beta1.mle = matrix(NA,S,length(v))
beta0.lse = matrix(NA,S,length(v))
beta1.lse = matrix(NA,S,length(v))
for (j in 1:length(v)){
  for (i in 1:S){
    set.seed(i)
    beta0 = 50
    beta1 = 10
    x = rnorm(n)
    e.t = rt(n,v[j])
    y.t = e.t + beta0 + beta1*x
    func1 = function(betas){
      beta0 = betas[1]
      beta1 = betas[2]
      sum = sum(log(1+1/v[j]*(y.t-beta0-beta1*x)^2))
      return((v[j]+1)/2*sum)
    }
    beta0.mle[i,j] = nlm(func1,c(1,1),iterlim = 1000)$estimate[1]
    beta1.mle[i,j] = nlm(func1,c(1,1),iterlim = 1000)$estimate[2]
    beta0.lse[i,j] = lm(y.t~x)$coef[1]
    beta1.lse[i,j] = lm(y.t~x)$coef[2]
  }
}

The function func1 inside the second for loop is used for nlm function (to find mle when errors are t distributed). I wanted to use parallel package in R but I didn't find any useful functions.


Solution

  • The key to getting anything to run faster in R is replacing for loops with vectorized functions (such as the apply family). Additionally, as for any programming language, you should look for places where you are calling expensive functions (such as nlm) more than once with the same parameters and see where you can store the results rather than recomputing each time.

    Here I am starting as you did by defining the parameters. Also since beta0 and beta1 always 50 and 10 I am going to define those here as well.

    S <- 10000
    n <- 100
    v <- c(5,10,50,100)
    beta0 <- 50
    beta1 <- 10
    

    Next we will define func1 outside the loop to avoid redefining it each time. func1 now has two extra parameters, v and y.t so that it can be called with the new values.

    func1 <- function(betas, v, y.t, x){
      beta0 <- betas[1]
      beta1 <- betas[2]
      sum <- sum(log(1+1/v*(y.t-beta0-beta1*x)^2))
      return((v+1)/2*sum)
    }
    

    Now we actually do the real work. Rather than having nested loops, we use nested apply statements. The outer lapply will make a list for each value of v and the inner vapply will make a matrix for the four values you want to get (beta0.mle, beta1.mle, beta0.sle, beta1.lse) for each value of S.

    values <- lapply(v, function(j) vapply(1:S, function(s) {
      # This should look familiar, it is taken from your code
      set.seed(s)
      x <- rnorm(n)
      e.t <- rt(n,j)
      y.t <- e.t + beta0 + beta1*x
      # Rather than running `nlm` and `lm` twice, we run it once and store the results
      nlmmod <- nlm(func1,c(1,1), j, y.t, x, iterlim = 1000)
      lmmod <- lm(y.t~x)
      # now we return the four values of interest
      c(beta0.mle = nlmmod$estimate[1],
        beta1.mle = nlmmod$estimate[2],
        beta0.lse = lmmod$coef[1],
        beta1.lse = lmmod$coef[2])
    }, numeric(4)) # this tells `vapply` what to expect out of the function
    )
    

    Finally we can reorganize everything into the four matrices.

    beta0.mle <- vapply(values, function(x) x["beta0.mle", ], numeric(S))
    beta1.mle <- vapply(values, function(x) x["beta1.mle", ], numeric(S))
    beta0.lse <- vapply(values, function(x) x["beta0.lse.(Intercept)", ], numeric(S))
    beta1.lse <- vapply(values, function(x) x["beta1.lse.x", ], numeric(S))
    

    As a final note, it may be possible to reorganize this to run even faster depending on why you are using the S index to set the seed. If it is important to know what seed was used to generate your x with rnorm then this may be there best I can do. However if you are only doing it to ensure that all of your values of v are being tested on the same values of x then there may be more reorganizing we can do that may produce more speed up using replicate.