Search code examples
excelmontecarlo

Why can't you do a Monte Carlo of a Monte Carlo in Excel


I teach a class in Excel and I want to take multiple averages from different sets of Monte Carlo simulations as a way to illustrate the central limit theorem. In Excel you can do a Monte Carlo using a data table and then take the average. I want to have a second data table that takes the average from multiple sets of simulations. However, the second data table will just return the same value in each cell. I'm including a screen shot with a very simple example to illustrate the issue.

enter image description here

Can anyone help me understand why this isn't working? Is there a way to get multiple averages using a Data Table? I know I could do this pretty easily in VBA, but I am curious about a solution that does not use VBA since most of the students do not know VBA.

Many thanks!

---- UPDATE ----

I wanted to build off of the fantastic answer given by @NeilT (if you haven't read that yet, do so first). I originally thought the issue with the second data table just had to do with the volatile function RAND() not updating the first data table when the second data table is calculated, but it is actually more basic than that. Any input into the first data table will not update as the second data table is calculated as this new example shows.

enter image description here

Even though the first data table formula now includes a reference to the cell that the second data table will be updating, the second data table still has a constant output. This helped me better understand what Neil meant by "But data tables don't work quite like this. Their volatility (or otherwise) depends only on the formula in the evaluation formula cell (not its value)." Hopefully it will be helpful to someone else as well.


Solution

  • There are two reasons this is not working as you expect.

    1. Reason #1 is that Excel's calculation tree is evaluated only once per calculation, so your second data table is only receiving one set of inputs of which to take the average, meaning that each of its five evaluations performs the same AVERAGE() calculation with the same result.
    2. Reason #2 is that when Excel recalculates this sheet, the outputs from the second data table are calculated independently of cells D5:D9 and cell H4 (so from Excel's perspective they do not depend directly on the outputs from the first data table, even though it looks like they should).

    Single calculation paradigm

    Suppose every formula only refers to cells above it and/or to its left. Then you can (very loosely) think of Excel's "calculation tree" as starting in the top left (i.e. A1) and working down and right on a cell-by-cell basis until every cell has been calculated.

    In your case, the order of calculations might be roughly:

    1. Cell B2 (first random number)
    2. Cell B3 (second random number)
    3. Cell D4 (sum of the two random numbers)
    4. The first data table (giving the outputs in D5:D9).
    5. Cell H4 (average of D5:D9)
    6. The second data table (giving the outputs in H5:H9).

    But note that (for example) steps 1 and 2 could be switched (i.e. calculated in the other order by Excel), because they don't depend on one another. In fact, it turns out that step 6 also does not depend on the previous steps (counterintuitively).

    At step 4, the first data table is being calculated correctly. This essentially means that the formula in D4 is evaluated five times. Since D4 does not depend on the input cell for the data table (C4), its inputs will not change. However, the five outputs are different because the formula uses the RAND() function, which is volatile, so the formula returns different values for each of the five output cell evaluations.

    At step 6, the second data table will evaluate the formula in its own input cell, H4. The formula in H4 is =AVERAGE(D5:D9). But by this point the first data table has already been evaluated. From the perspective of the evaluation formula, =AVERAGE(D5:D9), the values in D5:D9 are fixed. This means that the five evaluations of the formula will all return the same value.

    This is Reason #1 above: Excel has no reason to calculate the first data table multiple times to provide different sets of inputs for the second data table. Only one set of inputs is provided to the second data table, which means the AVERAGE() calculation returns the same value five times.

    Data tables and volatility

    To understand Reason #2, it is important to distinguish between calculating H4, which returns a result to cell H4, and repeatedly evaluating the data table formula (which happens to come from H4), which yields the second data table's outputs. The data table outputs themselves have no dependency on the value in cell H4 - merely on the formula in that cell.

    The reason the results from the second table are not the same as the value in H4 itself is because of the way Excel only recalculates what it thinks it needs to. If you are in automatic calculation mode, a formula including a volatile function such as RAND() will recalculate every time you change any cell in the workbook. Any cells depending on that volatile formula will also recalculate, and so on.

    But data tables don't work quite like this. Their volatility (or otherwise) depends only on the formula in the evaluation formula cell (not its value).

    In the first data table, the formula cell D4 is effectively volatile because it depends on the volatile formulas in B2 and B3. However, in the second data table, the formula cell H4 depends on cells D5:D9 which are the outputs from the first data table, and Excel doesn't consider those cells to be volatile even though the first data table's evaluation formula is volatile.

    What this means in practice is that Excel thinks it can calculate the second data table before evaluating H4 or even D5:D9. This is causing the values in your second table to be "one calculation behind": they depend on the values in cells D5:D9 at the start of the overall calculation (i.e. the end of the previous calculation), not the values in D5:D9 as evaluated during the current calculation.

    To see this, try the following:

    1. First, try repeatedly pressing F9. This will recalculate the full sheet. Note the value in H4 after each calculation. You will see that each time you press F9, the outputs from your second data table will change to the value that was in cell H4 before you pressed F9. Cell H4 itself will also be recalculated and will display a new value based on the new values in cells D5:D9.

    2. Next, try repeatedly pressing Ctrl+Alt+F9. This will force the whole worksheet to be recalculated from scratch. Now you will see that the values in the second data table do match the value in cell H4. But if you then press F9 once on its own, only H4 will change - the second data table's outputs will not.

    This illustrates Reason #2 above and explains why (in your screenshot) the second data table's outputs do not show the same value as H4.

    Alternatives

    Hopefully the above answers your question of why your workbook is not working as you want it to.

    If you want to evaluate an 'average of n averages', you will need to calculate the equivalent of n instances of the first data table. This is not straightforward to achieve in native Excel (without VBA) without cloning your first data table many times, which presumably you want to avoid.

    Put another way, your second data table's evaluation formula depends on five values (D5:D9), the five outputs from the first data table. If you want to calculate the average of n runs, you will need 5n results (not just 5).

    Excel does support some types of iterative calculation. There are articles online explaining how to do this but it might be beyond the scope of what you're trying to do.

    Finally: it is possible to achieve what you want very easily (without VBA) using an add-in such as Schematiq. (Disclaimer: I work for Schematiq.) This is illustrated in the screenshot below. I recognise that using an add-in is unlikely to be acceptable for your purposes but I include this in case it helps other people trying to solve the same problem.

    enter image description here

    In summary:

    1. Your second Monte Carlo simulation is receiving only one set of inputs, so all the outputs are the same.
    2. Your second data table isn't volatile and so is "one calculation behind" the result in H4.
    3. Unfortunately I don't believe what you're trying to do is possible to achieve directly using this approach.