Unable to see gains on using boost::multi_array as compared to raw C array of pointers or C++ vector<vector<double>>

I am running a simple benchmarking exercise of 2d array access of reading and writing.

I have const int XDim = 1000; and const int YDim = 1000; and create a 1000 x 1000 matrix.

In C-style, I initialize using

double* doubleArray[XDim];
for (int i = 0; i < XDim; i++)
    doubleArray[i] = (double*)malloc(YDim * sizeof(double));
for (int i = 0; i < XDim; i++)
    for (int j = 0; j < YDim; j++)
        doubleArray[i][j] = 0.;

In C++ std::vector<std::vector<double>> I use:

std::vector<std::vector<double>> doubleArray;
std::vector<double> vecdbl;
for (int i = 0; i < XDim; i++)
    doubleArray.push_back(vecdbl);
for (int i = 0; i < XDim; i++)
    for (int j = 0; j < YDim; j++)
        doubleArray[i].push_back(0.);

In boost style, I use:

typedef boost::multi_array<double, 2> OuterArray;
typedef boost::array<OuterArray::index, 2> OuterArrayIndices;
OuterArray::index rows = XDim;
OuterArray::index cols = YDim;
OuterArrayIndices shape = { rows, cols };
OuterArray doubleArray(shape);
for (int i = 0; i < XDim; i++)
    for (int j = 0; j < YDim; j++)
        doubleArray[i][j] = 0.;

Then, I randomly pick 999999 elements from doubleArray[][] and set them to a random integer value chosen in [-5, 5]. Then, I further pick randomly 999999 elements, add them up, then find their avarage. The average is found and the time taken for all of this is displayed. This exercise is repeated many times.

The Godbolt link that does all this is available over here

On the above link, Godbolt seems to only handle const int total_iterations = 2;. Larger values give an error that the time has exceeded. On running this on my computer with const int total_iterations = 100;, a consistent pattern I observe is that the boost method is unable to match the C method of malloc or the C++ method of std::vector<std::vector<double>>. Between the latter two, mostly, the C method beats the C++ method while there are a handful of instances of the reverse. All of this is compiled and run in Visual Studio in Release mode using default settings.

I am unable to see the value of boost::multi_array given this over raw C pointer of pointer or std::vector<std::vector<double>>. Is there a way boost::multi_array can be made faster?

Edited to add: Based on comments, with sequential access to array elements, and release mode settings on Godbolt, boost::multi_array wins majority of the times. See here 9 out of 10 times.

With nonsequential, random access to elements, and release mode settings on Godbolt, it is close, with C style slightly better than boost::multi_array. See here

Thanks to tstanisl in the comments, for sequential access, C style seems to be faster in 10 out of 10 times. His link is accessible here

There is an overall caveat pointed out by user n-m-could-be-an-ai that this profiling could be profiling the random number generator. Controlling for that seems to be beyond the purview of this question. However, it appears that regardless, since the same number of RNGs are generated for each case, and since the number of iterations is also not 1 or 2, but larger, that could be a wash. I could be wrong.

Edited to add(2): Based on answer provided by chqrlie, I have modified the code which is available here. I have commented it hopefully sufficiently to see what is going on. It is still not fully clear to me which way is faster, but perhaps someone else can use this code as a starting point to better benchmark these various data structures.

Solution

I'm afraid your benchmark has serious flaws that make the results almost meaningless:

you compile the C++ code without optimisations. Add -O2 or -O3 in the command option field of the compilation window after uncovering it.
you use Godbolt's infrastructure to run the benchmark: the CPUs in the cloud are wildly overloaded, which may explain the inconsistent results between runs and even from one method to another.
it is unclear how efficient the random number generators are compared to the matrix access times: you should use a separate array to pre-compute these values or at least measure the time without array access, summing all such computations to avoid optimisation side effects.

Here is a modified version with an additional matrix implementation using C99 VLAs, which are supported by gcc: https://godbolt.org/z/v5o8P5M6d

Try it on a standalone idle computer to get meaningful results.