Search code examples
rmatrixmultidimensional-arraystatisticssimulation

Creating a function to characterise repeated simulations


I want to create a function which helps characterise the results to some simulations. For the purposes of this post let the simulation function be:

example_sim <- function(time=100, npops=5){
  result <- data.frame(matrix(NA, nrow = time, ncol = npops))
  colnames(result) <- LETTERS[1:npops]
  for(i in 1:npops){
    sim <- sample.int(time, time)
    result[,i] <- sim
    result[,i] <- result[,i]*i
  }
  return(result)
}

This creates a data frame with varying length and width based on the number of populations (npops) and the time simulated.

I want to create a function which uses the output of such simulations and characterises the mean, variance for each population over an n amount of simulations (nsims).

So far I have managed to get it working for two populations with the following code:

library("matrixStats")
library("reshape2")

ensembles <- function(nsims=10, time = 100, npops = 2){
  result_N.A <- data.frame(matrix(NA, nrow = time, ncol = nsims))
  result_N.B <- data.frame(matrix(NA, nrow = time, ncol = nsims))
  for( i in 1:(nsims)){
    simulation_with_2pops <- example_sim(time=100,npops=2)
    result_N.A[,i] <- simulation_with_2pops[,1]
    result_N.B[,i] <- simulation_with_2pops[,2]
  }
  output <- simulation_with_2pops
  for( j in 1:params$ntime){
    output$meanA[j] <- rowMeans(result_N.A[j,])
  }
  for( j in 1:params$ntime){
    output$meanB[j] <- rowMeans(result_N.B[j,])
  }
  for( j in 1:params$ntime){
    output$varA[j] <- rowVars(as.matrix(result_N.A[j,]))
  }
  for( j in 1:params$ntime){
    output$varB[j] <- rowVars(as.matrix(result_N.B[j,]))
  }
  return(output)
} 
ensembles_output<- ensembles(nsims = 10)
ensembles_output

To fully implement the function for any number of populations I would need to create another for loop where I create and update the result_N.A object. (Presumably called something like result[i].) I have also thought about creating a 3 dimensional object (time, npops, nsims) and taking a slice of it to calculate the mean and variance but i havent had much success yet. I am not married for this route and am very open to other recommendations.

Eventually I would like to create a code where the covariance and correlation are also calculated by giving highlighting two populations in the parameters. (for instance population A and population E). If you have any ideas on the implementation i would be very grateful to hear them.

Thank you for considering this problem.


Solution

  • I think using a multidimensional array is a very good idea in this case.

    First, you can get the simulations of example_sim() much cheaper using mapply(). Here an example with time=10 and npops=3. Use the same set.seed(42) and parameters and check for yourself.

    I use much smaller parameters here so that you can easily check the result in your head.

    set.seed(42)
    sim <- replicate(nsims, mapply(\(time, i) sample.int(time, time)*i, 10, 1:3))
    
    sim
    # , , 1
    # 
    #       [,1] [,2] [,3]
    #  [1,]    1   16   27
    #  [2,]    5   14   30
    #  [3,]   10    8    9
    #  [4,]    8    2   12
    #  [5,]    2   10   15
    #  [6,]    4   20   18
    #  [7,]    6    4    3
    #  [8,]    9   12    6
    #  [9,]    7   18   24
    # [10,]    3    6   21
    # 
    # , , 2
    # 
    #       [,1] [,2] [,3]
    #  [1,]    3   10   18
    #  [2,]    1    8    6
    #  [3,]    2    4   12
    #  [4,]    6   16    9
    #  [5,]   10    6   30
    #  [6,]    8    2   15
    #  [7,]    4   20   27
    #  [8,]    5   14   21
    #  [9,]    7   12   24
    # [10,]    9   18    3
    # 
    # , , 3
    # 
    #       [,1] [,2] [,3]
    #  [1,]   10    8   18
    #  [2,]    8   18    6
    #  [3,]    5    6   27
    #  [4,]    1   16    3
    #  [5,]    7   10   24
    #  [6,]    4   12   15
    #  [7,]    6   20   30
    #  [8,]    2    4    9
    #  [9,]    9    2   12
    # [10,]    3   14   21
    

    Next, I believe you want to gather row-wise statistics across each population column A, B, C, ... . Here you basically want apply(., MARGINS=1:2, FUN). Just for the mean there exists rowMeans(., dims=2L), which is faster.

    rowMeans(sim, dims=2L)
    #           [,1]      [,2] [,3]
    #  [1,] 4.666667 11.333333   21
    #  [2,] 4.666667 13.333333   14
    #  [3,] 5.666667  6.000000   16
    #  [4,] 5.000000 11.333333    8
    #  [5,] 6.333333  8.666667   23
    #  [6,] 5.333333 11.333333   16
    #  [7,] 5.333333 14.666667   20
    #  [8,] 5.333333 10.000000   12
    #  [9,] 7.666667 10.666667   20
    # [10,] 5.000000 12.666667   15
    
    apply(sim, 1:2, var)
    #            [,1]      [,2] [,3]
    #  [1,] 22.333333 17.333333   27
    #  [2,] 12.333333 25.333333  192
    #  [3,] 16.333333  4.000000   93
    #  [4,] 13.000000 65.333333   21
    #  [5,] 16.333333  5.333333   57
    #  [6,]  5.333333 81.333333    3
    #  [7,]  1.333333 85.333333  219
    #  [8,] 12.333333 28.000000   63
    #  [9,]  1.333333 65.333333   48
    # [10,] 12.000000 37.333333  108
    

    I'm not sure however, why you use simulation_with_2pops for your final output, since it's the result of last iteration of for (i in 1:nsims) loop. Anyway, hope this helps you further.

    Note: R >= 4.1 used.