Search code examples
probabilitymontecarlo

Calculate certainty of Monte Carlo simulation


Let's say that we use the Monte Carlo method to estimate the area of an object, in the exact same way you'd use it to estimate the value of π.

Now, let's say we want to calculate the certainty of our simulation result. We've cast n samples, m of which landed inside the object, so the area of the object is approximately m/n of the total sampled area. We would like to make a statement such as:

"We are 99% certain that the area of the object is between a1 and a2."

How can we calculate a1 and a2 above (given n, m, total area, and the desired certainty)?

Here is a program which attempts to estimate this bound numerically. Here the samples are points in [0,1), and the object is the segment [0.25,0.75). It prints a1 and a2 for 50%, 90%, and 99%, for a range of sample counts:

import std.algorithm;
import std.random;
import std.range;
import std.stdio;

void main()
{
    foreach (numSamples; iota(0, 1000+1, 100).filter!(n => n > 0))
    {
        auto samples = new double[numSamples];
        enum objectStart = 0.25;
        enum objectEnd   = 0.75;

        enum numTotalSamples = 10_000_000;
        auto numSizes = numTotalSamples / numSamples;
        auto sizes = new double[numSizes];
        foreach (ref size; sizes)
        {
            size_t numHits;
            foreach (i; 0 .. numSamples)
            {
                auto sample = uniform01!double;
                if (sample >= objectStart && sample < objectEnd)
                    numHits++;
            }

            size = 1.0 / numSamples * numHits;
        }

        sizes.sort;
        writef("%d samples:", numSamples);
        foreach (certainty; [50, 90, 99])
        {
            auto centerDist = numSizes * certainty / 100 / 2;
            auto startPos = numSizes / 2 - centerDist;
            auto endPos   = numSizes / 2 + centerDist;
            writef("\t%.5f..%.5f", sizes[startPos], sizes[endPos]);
        }
        writeln;
    }
}

(Run it online.) It outputs:

//                     50%                 90%                 99%
100 samples:    0.47000..0.53000    0.42000..0.58000    0.37000..0.63000
200 samples:    0.47500..0.52500    0.44500..0.56000    0.41000..0.59000
300 samples:    0.48000..0.52000    0.45333..0.54667    0.42667..0.57333
400 samples:    0.48250..0.51750    0.46000..0.54250    0.43500..0.56500
500 samples:    0.48600..0.51600    0.46400..0.53800    0.44200..0.55800
600 samples:    0.48667..0.51333    0.46667..0.53333    0.44833..0.55167
700 samples:    0.48714..0.51286    0.46857..0.53143    0.45000..0.54857
800 samples:    0.48750..0.51250    0.47125..0.53000    0.45375..0.54625
900 samples:    0.48889..0.51111    0.47222..0.52667    0.45778..0.54111
1000 samples:   0.48900..0.51000    0.47400..0.52500    0.45800..0.53900

Is it possible to precisely calculate these numbers instead?

(Context: I'd like to add something like "±X.Y GB with 99% certainty" to btdu)


Solution

  • Ok, with question being language agnostic, here is the illustration how to do error estimation with Monte-Carlo.

    Suppose, you want to compute integral

    I = S01 f(x) dx

    where f(x) is simple polynomial function

    f(x) = xn

    Here is the illustration of the calculations.

    For that you have to compute not only mean value, but standard deviation as well.

    Then, knowing that Monte Carlo error is going down as inverse square root of number of samples, computing confidence interval is simple

    Code, Python 3.7, Windows 10 x64

    import numpy as np
    
    rng = np.random.default_rng()
    
    N = 100000
    
    n = 2
    
    def f(x):
        return np.power(x, n)
    
    sample = f(rng.random(N)) # N samples of the function
    
    m = np.mean(sample)        # mean value of the sample, approaching integral value as N->∞
    s = np.std(sample, ddof=1) # standard deviation with Bessel correction
    
    e = s / np.sqrt(N) # Monte Carlo error decreases as inverse square root
    
    t = 2.576 # For 99% confidence interval, we should take 2.58 sigma, per Gaussian distribution
    #t = 3.00 # For 99.7% confidence interval, we should take 3 sigma, per Gaussian distribution
    
    print(f'True integral value is {1.0/(1.0+n)}')
    print(f'Computed integral value is in the range [{m-t*e}...{m+t*e}] with 99% confidence')
    

    will print something like

    True integral value is 0.3333333333333333
    Computed integral value is in the range 
    [0.33141772204489295...0.3362795491124624] with 99% confidence
    

    You could use Z-score table, line this one along the lines, to print table you want. You could vary N to get desired N dependency

    zscore = {'50%': 0.674, '80%': 1.282, '90%': 1.645, '95%': 1.960, '98%': 2.326, '99%': 2.576, '99.7%': 3.0}
    for c, z in zscore.items():
        print(f'Computed integral value is in the range [{m-z*e}...{m+z*e}] with {c} confidence')