Why is Matlab 11 times faster than C++

I am comparing the speed of a Monte Carlo pricing algorithm for a Vanilla call option between Matlab and C++. This is not the same as Why is MATLAB so fast in matrix multiplication? since the speed-up is not due to matrix multiplication (there is only a dot product which is done quickly) but seems to be due to its highly efficient Gaussian random number generator.

In Matlab the code has been vectorised and the code looks as follows

function [ value ] = OptionMCValue( yearsToExpiry, spot, strike, riskFreeRate, dividendYield, volatility, numPaths  )

    sd = volatility*sqrt(yearsToExpiry);
    sAdjusted = spot * exp( (riskFreeRate - dividendYield - 0.5*volatility*volatility) * yearsToExpiry);

    g = randn(1,numPaths);
    sT = sAdjusted * exp( g * sd );
    values = max(sT-strike,0);`
    value = mean(values);
    value = value * exp(-riskFreeRate * yearsToExpiry);

end

If I run this with 10 million paths as follows

strike = 100.0;
yearsToExpiry = 2.16563;
spot = 100.0;
volatility = 0.20;
dividendYield = 0.03;
riskFreeRate = 0.05;
oneMillion = 1000000;
numPaths = 10*oneMillion;

tic
value = OptionMCValue( yearsToExpiry, spot, strike, riskFreeRate, dividendYield, volatility, numPaths  );
toc

I get

Elapsed time is 0.359304 seconds.
   12.8311

Now I do the same thing in C++ in VS2013

My code is in an OptionMC class and is as follows

double OptionMC::value(double yearsToExpiry, 
                   double spot,
                   double riskFreeRate,
                   double dividendYield,
                   double volatility, 
                   unsigned long numPaths )
{
    double sd = volatility*sqrt(yearsToExpiry);
    double sAdjusted = spot * exp( (riskFreeRate - dividendYield - 0.5*volatility*volatility) * yearsToExpiry);
    double value = 0.0;
    double g, sT;

    for (unsigned long i = 0; i < numPaths; i++)
    {
        g = GaussianRVByBoxMuller();
        sT = sAdjusted * exp(g * sd);
        value += Max(sT - m_strike, 0.0);
    }

    value = value * exp(-riskFreeRate * yearsToExpiry);
    value /= (double) numPaths;
    return value;
}

The BM code is as follows

double GaussianRVByBoxMuller()
{
double result;
double x; double y;;
double w;

do
{
    x = 2.0*rand() / static_cast<double>(RAND_MAX)-1;
    y = 2.0*rand() / static_cast<double>(RAND_MAX)-1;
    w = x*x + y*y;
} while (w >= 1.0);

w = sqrt(-2.0 * log(w) / w);
result = x*w;

return result;
}

I have set the Optimization option to Optimize for speed in Visual Studio.

For 10m paths it takes 4.124 seconds.

This is 11 times slower than Matlab.

Can anyone explain the difference ?

EDIT: On further testing the slow down does seem to be the call to GaussianRVByBoxMuller. Matlab seems to have a very efficient implementation - the Ziggurat method. Note that BM is sub-optimal here as it generates 2 RVs and I only use 1. Fixing just this would give a factor of 2 speed-up.

Solution

As it stands right now, you're generating single-threaded code. At a guess, Matlab is using multi-threaded code. That allows it to run faster by a factor of approximately N, where N = the number of cores in your CPU.

There's a little more to the story than that though. One more problem that arises is that you're using rand(), which uses hidden, global state. As such, if you do a simple rewrite of your code to make it multi-threaded, chances are pretty good that conflicts over rand()'s internal state will prevent you from getting much speed improvement (and might easily run slower--perhaps quite a bit slower).

To cure that, you might consider (for example) using the new random number generation (and possibly distribution) classes added in C++11. With these, you can create a separate instance of the random number generator for each thread, preventing conflict over their internal state.

I rewrote your code a little bit to use those, and call the function, to get this:

double m_strike = 100.0;

class generator {
    std::normal_distribution<double> dis;
    std::mt19937_64 gen;
public:
    generator(double lower = 0.0, double upper = 1.0)
        : gen(std::random_device()()), dis(lower, upper) {}

    double operator()() {
        return dis(gen);
    }
};

double value(double yearsToExpiry,
    double spot,
    double riskFreeRate,
    double dividendYield,
    double volatility,
    unsigned long numPaths)
{
    double sd = volatility*sqrt(yearsToExpiry);
    double sAdjusted = spot * exp((riskFreeRate - dividendYield - 0.5*volatility*volatility) * yearsToExpiry);
    double value = 0.0;
    double g, sT;

    generator gen;

// run iterations in parallel, with a private random number generator for each thread:
#pragma omp parallel for reduction(+:value) private(gen)
    for (long i = 0; i < numPaths; i++)
    {
        g = gen(); // GaussianRVByBoxMuller();
        sT = sAdjusted * exp(g * sd);
        value += std::max(sT - m_strike, 0.0);
    }

    value = value * exp(-riskFreeRate * yearsToExpiry);
    value /= (double)numPaths;
    return value;
}

int main() {
    std::cout << "value: " << value(2.16563, 100.0, 0.05, 0.03, 0.2, 10'000'000) << "\n";
}

I compiled this with VC++ 2015, using the following command line:

cl -openmp -Qpar -arch:AVX -O2b2 -GL test.cpp

On an AMD A8-7600 this ran in ~.31 seconds.
On an Intel i7 processor, this ran in ~.16 seconds.

Of course, if you have a CPU with a lot more cores, you stand a decent chance of this running quite a bit faster still.

As it stands right now, my code requires VC++ 2015 instead of 2013, but I doubt that it affects performance much. It's mostly just convenience, like using 10'000'000 instead of 10000000 (but I'm not going to install a copy of 2013 on this machine just to figure out exactly what I need to change to accommodate it).

Also note than on a recent Intel processor, you might (or might not) gain some improvement by changing the arch:AVX to arch:AVX2 instead.

A quick check of single-threaded code indicates that your Box-Muller distribution code may be a tad faster than the standard library's normal distribution code, so switching to a thread-friendly version of that might gain a little more speed (and the optimized version should approximately double that as well).