Search code examples
rmontecarlopercentilebernoulli-probability

Need help structuring a Monte Carlo simulation and finding percentiles of the result with R


I have a CSV file containing a set of events (ca 40 items), all of which can either happen or not, depending on given probability. Columns: event name, yield size, probability.

What interests me of this data is the total yield size of the set (sum of all yields of the set), and possibly also total sum of yield per event. So, as event can either happen on not and therefore total yield size of the set can differ, I need to make a Monte Carlo simulation over the set, having a Bernoulli trial over Probability column.

And lastly I need to calculate percentiles over sum of Yield of the whole set or a specific event over all Monte Carlo simulation iterations (scenarios).

I'm having trouble writing it down..(I'm still learning R, i'm more used to Java/C# etc)

The code I made currently:

#Generate sample data for a set of events that I want to simulate
eventcol <- c('Event1', 'Event2', 'Event3', 'Event4', 'Event5')
yieldcol <- c(350, 200, 100, 120, 540)
problcol <- c(0.5, 0.2, 0.9, 0.4, 0.7)
events <- data.frame(Name=eventcol, Yield=yieldcol, Probability=problcol)

#Forecast function
forecast <- function(events){
  count <- nrow(events)
  data <- data.frame(Id=seq(1, count))
  data$Name <- events$Name
  data$Yield <- events$Yield
  data$Exists <- rbinom(count,1,events$Probability)
  return(data)
}

#Create Monte Carlo simulation scenarios/realizations
scenarios <- replicate(4, forecast(events))
scenarios

The output is the following:

> scenarios
       [,1]      [,2]      [,3]      [,4]     
Id     Integer,5 Integer,5 Integer,5 Integer,5
Name   factor,5  factor,5  factor,5  factor,5 
Yield  Numeric,5 Numeric,5 Numeric,5 Numeric,5
Exists Numeric,5 Numeric,5 Numeric,5 Numeric,5

But I have no clue how to sum Yield over events that do exist (Exists == 1) per scenario, let alone then find a percentile (with quantile function) over the sums. How would you proceed with that?

Concerning data structure, I have a few ideas, but I'm not sure..

  1. Maybe I should transpose the forecast and then somehow iterate over MC scenarios one-by-one and sum the data?

  2. Maybe I should filter out events from the results that do not exist (Exists == 0). But how and where should I do that?

It would probably also make more sense if the results would look like this (but also I have no ideas how to achieve this):

Scenario     Name     Yield
1            Event1   350
1            Event2   200
2            Event1   350
...

Please share your thoughts!

Thankyou!


Solution

  • Yes, question is much clearer now!

    The scenarios output is a collection of lists. scenarios[3,] contains the 'potential yields', scenarios[4,] contains the 'exists'.

    The actual yield for each scenario can be determined as follows:

    potential_yields = simplify2array(scenarios[3,])
    exists           = simplify2array(scenarios[4,])
    actual_yields    = colSums(yields*exists)
    

    Determine and plot quantiles:

    yield_q  = quantile(actual_yields,probs=0:100/100)
    plot(x = 0:100, y = yield_q)
    

    Perhaps that is what you're after.