Search code examples
crandomintegrationmontecarlo

Monte Carlo integration of the Gaussian function f(x) = exp(-x^2/2) in C incorrect output


I'm writing a short program to approximate the definite integral of the gaussian function f(x) = exp(-x^2/2), and my codes are as follows:

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

double gaussian(double x) {
    return exp((-pow(x,2))/2);
}

int main(void) {
    srand(0);
    double valIntegral, yReal = 0, xRand, yRand, yBound;
    int xMin, xMax, numTrials, countY = 0;

    do {
        printf("Please enter the number of trials (n): ");
        scanf("%d", &numTrials);
        if (numTrials < 1) {
            printf("Exiting.\n");
            return 0;
        }  
        printf("Enter the interval of integration (a b): ");
        scanf("%d %d", &xMin, &xMax);      
        while (xMin > xMax) { //keeps looping until a valid interval is entered
            printf("Invalid interval!\n");
            printf("Enter the interval of integration (a b): ");
            scanf("%d %d", &xMin, &xMax);
        }
        //check real y upper bound
        if (gaussian((double)xMax) > gaussian((double)xMin))
            yBound = gaussian((double)xMax);
        else 
            yBound = gaussian((double)xMin);
        for (int i = 0; i < numTrials; i++) {
            xRand = (rand()% ((xMax-xMin)*1000 + 1))/1000.00 + xMin; //generate random x value between xMin and xMax to 3 decimal places             
            yRand = (rand()% (int)(yBound*1000 + 1))/1000.00; //generate random y value between 0 and yBound to 3 decimal places
            yReal = gaussian(xRand);
            if (yRand < yReal) 
                countY++;
        }
        valIntegral = (xMax-xMin)*((double)countY/numTrials);
        printf("Integral of exp(-x^2/2) on [%.3lf, %.3lf] with n = %d trials is: %.3lf\n\n", (double)xMin, (double)xMax, numTrials, valIntegral);

        countY = 0; //reset countY to 0 for the next run
    } while (numTrials >= 1);

    return 0;
}

However, the outputs from my code doesn't match the solutions. I tried to debug and print out all xRand, yRand and yReal values for 100 trials (and checked yReal value with particular xRand values with Matlab, in case I had any typos), and those values didn't seem to be out of range in any way... I don't know where my mistake is.

The correct output for # of trials = 100 on [0, 1] is 0.810, and mine is 0.880; correct output for # of trials = 50 on [-1, 0] is 0.900, and mine was 0.940. Can anyone find where I did wrong? Thanks a lot.

Another question is, I can't find a reference to the use of following code:

double randomNumber = rand() / (double) RAND MAX;

but it was provided by the instructor and he said it would generate a random number from 0 to 1. Why did he use '/' instead of '%' after "rand()"?


Solution

  • There's a few logical errors / discussion points in your code, both mathematics and programming-wise.

    First of all, just to get it out of the way, we're talking about the standard gaussian here, i.e.

    [latex] $$ \mathcal {N} (x ~|~ \mu = 0, \sigma^2 = 1) ~=~ \frac {1} {\sqrt {2 \pi}} \exp \Biggl \{ - \frac {x^2} {2} \Biggr \} $$ [/latex]

    except, the definition of the gaussian on line 6, omits the [latex] $$ \sqrt{2\pi} $$ [/latex] normalising term. Given the outputs you seem to expect, this seems to have been done on purpose. Fair enough. But if you wanted to calculate the actual integral, such that a practically infinite range (e.g. [-1000, 1000]) would sum up to 1, then you would need that term.


    Is my code logically correct?

    No. Your code has two logical errors: one on line 29 (i.e. your if statement), and one on line 40 (i.e. the calculation of valIntegral), which is a direct consequence of the first logical error.

    For the first error, consider the following plot to see why:

    Your Monte Carlo process effectively considers a bounded box over a certain range, and then says "I will randomly place points inside this box, and then count the proportion of the total number of points that randomly fell under the curve; the integral estimate is then the area of the bounded box itself, times this proportion".

    Now, if both [latex] $$ x \! _ { _ { m \! i \! n }} $$ [/latex] and [latex] $$ x \! _ { _ { m \! a \! x }} $$ [/latex] are to the left of the mean (i.e. 0), then your if statement correctly sets the box's upper bound (i.e. yBound) to [latex] $$ \mathcal {N} ( x \! _ { _ {m \! a \! x}} ) $$ [/latex] such that the topmost bound of the box contains the highest part of that curve. So, e.g., to estimate the integral for the range [-2,-1], you set the upper bound to [latex] $$ \mathcal {N} ( - 1 ) $$ [/latex] .

    Similarly, if both [latex] $$ x \! _ { _ { m \! i \! n }} $$ [/latex] and [latex] $$ x \! _ { _ { m \! a \! x }} $$ [/latex] are to the right of the mean, then you correctly set yBound to [latex] $$ \mathcal {N} ( x \! _ { _ {m \! i \! n}} ) $$ [/latex]

    However, if [latex] $$ x ! _ { _ {m ! i ! n}} < 0 < x ! _ { _ {m ! a ! x}} $$ [/latex] , you should be setting yBound to neither [latex] $$ x \! _ { _ { m \! i \! n }} $$ [/latex] nor [latex] $$ x \! _ { _ { m \! a \! x }} $$ [/latex] , since the 0 point is higher than both!. So in this case, your yBound should simply be at the peak of the Gaussian, i.e. [latex] $$ \mathcal {N} ( 0 ) $$ [/latex] (which in your case of an unnormalised Gaussian, this takes a value of '1').

    Therefore, the correct if statement is as follows:

    if (xMax < 0.0)
      { yBound = gaussian((double)xMax); }
    else if (xMin > 0.0)
      { yBound = gaussian((double)xMin); }
    else
      { yBound = gaussian(0.0); }
    

    As for the second logical error, we already mentioned that the value of the integral is the "area of the bounding box" times the "proportion of successes". However, you seem to ignore the height of the box in your calculation. It is true that in the special case where [latex] $$ x ! _ { _ {m ! i ! n}} < 0 < x ! _ { _ {m ! a ! x}} $$ [/latex] , the height of your unnormalised Gaussian function defaults to '1', therefore this term can be omitted. I suspect that this is why it may have been missed. However, in the other two cases, the height of the bounding box is necessarily less than 1, and therefore needs to be included in the calculation. So the correct code for line 40 should be:

    valIntegral = yBound * (xMax-xMin) * (((double)countY)/numTrials);
    

    Why am I not getting the correct output?

    Even despite the above logical errors, as we've discussed above, your output should have been correct for the specific intervals [0,1] and [-1,0] (since they include the mean and therefore the correct yBound of 1). So why are you still getting a 'wrong' output?

    The answer is, you are not. Your output is "correct". Except, a Monte Carlo process involves randomness, and 100 trials is not a big enough number to lead to consistent results. If you run the same range for 100 trials again and again, you'll see you'll get very different results each time (though, overall, they'll be distributed around the right value). Run with 1000000 trials, and you'll see that the result becomes a lot more precise.


    What's up with that randomNumber code?

    The rand() function returns an integer in the range [0, RAND_MAX], where RAND_MAX is system-specific (have a look at man 3 rand).

    The modulo approach (i.e. %) works as follows: consider the range [-0.1, 0.3]. This range spans 0.4 units. 0.4 * 1000 + 1 = 401. For a random number from 0 to RAND_MAX, doing rand() modulo 401 will always result in a random number in the range [0,400]. If you then divide this back by 1000, you get a random number in the range [0, 0.4]. Add this to your xmin offset (here: -0.1) and you get a random number in the range [-0.1, 0.3].

    In theory, this makes sense. However, unfortunately, as already pointed out in the other answer here, as a method it is susceptible to modulo bias, because RAND_MAX isn't necessarily exactly divisible by 401, therefore the top part of that range leading up to RAND_MAX overrepresents some numbers compared to others.

    By contrast, the approach given to you by your teacher is simply saying: divide the result of the rand() function with RAND_MAX. This effectively normalises the returned random number into the range [0,1]. This is a much more straightforward thing to do, and it avoids modulo bias.

    Therefore, the way I would implement this would be to make it into a function:

    double randomNumber(void) {
      return rand() / (double) RAND_MAX;
    }
    

    which then simplifies your computations as follows too:

    xRand = randomNumber() * (xMax-xMin) + xMin;
    yRand = randomNumber() * yBound;
    

    You can see that this is a much more accurate thing to do, if you use a normalised gaussian, i.e.

    double gaussian(double x) {
      return exp((-pow(x,2.0))/2.0) / sqrt(2.0 * M_PI);
    }
    

    and then compare the two methods. You will see that the randomNumber() method for an "effectively infinite" range (e.g. [-1000,1000]) gives the correct result of 1, whereas the modulo approach tends to give numbers that are larger than 1.