Search code examples
rrandomsimulationstatamontecarlo

Equivalent of Stata command `simulate` in R for Montecarlo Simulation


I am searching for an equivalent function in R of the extremely convenient Stata command simulate. The command basically allows you to declare a program (reg_simulation in the example below) and then invoke such a program from simulate and store desired outputs.

Below is a Stata illustration of the usage of the simulate program, together with my attempt to replicate it using R.

Finally, my main question is: is this how R users will run a Montecarlo simulation? or am I missing something in terms of structure or speed bottlenecks? Thank you a lot in advance.

Stata example

  1. Defining reg_simulation program.
clear all
*Define "reg_simulation" to be used later on by "simulate" command 
program reg_simulation, rclass
    *Declaring Stata version
    version 13
    *Droping all variables on memory
    drop _all
    *Set sample size (n=100)
    set obs 100
    *Simulate model
    gen x1 = rnormal()
    gen x2 = rnormal()
    gen y = 1 + 0.5 * x1 + 1.5 *x2 + rnormal()
    *Estimate OLS
    reg y x1 x2 
    *Store coefficients
    matrix B = e(b)
    return matrix betas = B 
end
  1. Calling reg_simulation from simulate command:
*Seet seed
set seed 1234
*Run the actual simulation 10 times using "reg_simulation"
simulate , reps(10) nodots: reg_simulation
  1. Obtained result (stored data on memory)
_b_x1   _b_x2   _b_cons
.4470155    1.50748     1.043514
.4235979    1.60144     1.048863
.5006762    1.362679    .8828927
.5319981    1.494726    1.103693
.4926634    1.476443    .8611253
.5920001    1.557737    .8391003
.5893909    1.384571    1.312495
.4721891    1.37305     1.017576
.7109139    1.47294     1.055216
.4197589    1.442816    .9404677

R replication of the Stata program above.

Using R I have managed to get the following (not an R expert tho). However, the part that worries me the most is the for-loop structure that loops over each the number of repetitions nreps.

  1. Defining reg_simulation function.
#Defining a function 
reg_simulation<- function(obs = 1000){
    data <- data.frame(
    #Generate data
    x1 <-rnorm(obs, 0 , 1) ,
    x2 <-rnorm(obs, 0 , 1) ,
    y <- 1 + 0.5* x1 + 1.5 * x2 + rnorm(obs, 0 , 1) )
  #Estimate OLS
  ols <- lm(y ~ x1 + x2, data=data)  
  return(ols$coefficients)  
}
  1. Calling reg_simulation 10 times using a for-loop structure:
#Generate list to store results from simulation
results_list <- list()
# N repetitions
nreps <- 10
for (i in 1:nreps) {
  #Set seed internally (to get different values in each run)
  set.seed(i)
  #Save results into list
  results_list[i]  <- list(reg_simulation(obs=1000))  
}
#unlist results
df_results<- data.frame(t(sapply(results_list, 
                       function(x) x[1:max(lengths(results_list))])))
  1. Obtained result: df_results.
#final results
df_results
#   X.Intercept.  x1        x2
# 1     1.0162384 0.5490488 1.522017
# 2     1.0663263 0.4989537 1.496758
# 3     0.9862365 0.5144083 1.462388
# 4     1.0137042 0.4767466 1.551139
# 5     0.9996164 0.5020535 1.489724
# 6     1.0351182 0.4372447 1.444495
# 7     0.9975050 0.4809259 1.525741
# 8     1.0286192 0.5253288 1.491966
# 9     1.0107962 0.4659812 1.505793
# 10    0.9765663 0.5317318 1.501162

Solution

  • You're on the right track. Couple of hints/corrections:

    1. Don't use <- inside data.frame()

    In R, we construct data frames using = for internal column assignment, i.e. data.frame(x = 1:10, y = 11:20) rather than data.frame(x <- 1:10, y <- 11:20).

    (There's more to be said about <- vs =, but I don't want to distract from your main question.)

    In your case, you don't actually even need to create a data frame since x1, x2 and y will all be recognized as "global" variables within the scope of the function. I'll post some code at the end of my answer demonstrating this.

    1. When growing a list via a for loop in R, always try to pre-allocate the list first

    Always try to pre-allocate the list length and type if you are going to grow a (long) for loop. Reason: That way, R knows how much memory to efficiently allocate to your object. In the case where you are only doing 10 reps, that would mean starting with something like:

    results_list <- vector("list", 10)
    

    3. Consider using lapply instead of for

    for loops have a bit of bad rep in R. (Somewhat unfairly, but that's a story for another day.) An alternative that many R users would consider is the functional programming approach offered by lapply. I'll hold off on showing you the code for a second, but it will look very similar to a for loop. Just to note quickly, following on from point 2, that one immediate benefit is that you don't need to pre-allocate the list with lapply.

    4. Run large loops in parallel

    A Monte Carlo simulation is an ideal candidate for running everything in parallel, since each iteration is supposed to be independent of the others. An easy way to go parallel in R is via the future.apply package.


    Putting everything together, here's how I'd probably do your simulation. Note that this might be more "advanced" than you possibly need, but since I'm here...

    library(data.table)   ## optional, but what I'll use to coerce the list into a DT
    library(future.apply) ## for parallel stuff
    plan(multisession)    ## use all available cores
    
    obs <- 1e3
    
    # Defining a function 
    reg_simulation <- function(...){
        x1 <- rnorm(obs, 0 , 1)
        x2 <- rnorm(obs, 0 , 1)
        y <- 1 + 0.5* x1 + 1.5 * x2 + rnorm(obs, 0 , 1)
        #Estimate OLS
        ols <- lm(y ~ x1 + x2)  
        
        # return(ols$coefficients)
        return(as.data.frame(t(ols$coefficients)))
    }
    
    # N repetitions
    nreps <- 10
    
    ## Serial version
    # results  <- lapply(1:nreps, reg_simulation)
    
    ## Parallel version
    results  <- future_lapply(1:nreps, reg_simulation, future.seed = 1234L)
    
    ## Unlist / convert into a data.table
    results <- rbindlist(results)