Optimizing 2D vectors and arrays in C++

I'm working to optimize access performance for large 2D contiguous arrays in C++ where I expect to be performing structured accesses. However, there seems to be a lot of mixed information across StackOverflow (and elsewhere on the web) on the most efficient way of actually tackling this problem.

I used the following code to test runtime of four different solutions. Compilation was performed via g++ -std=c++20 -O3 vectortest.cpp -o vectortest on clang version 14.0.3.

#include <vector>
#include <chrono>
#include <iostream>

using namespace std::chrono;

#define ARRAYSIZE 25000

template <typename T>
class vector2d {
    public:
        size_t size_x, size_y;
        std::vector<T> mvec;

    public:
        vector2d(
            size_t a_size_x,
            size_t a_size_y
        ) :
            size_x(a_size_x),
            size_y(a_size_y)
        {
            mvec.resize(size_x * size_y);
        }

    public:
        T & operator()(size_t i, size_t j) {
            return mvec[i * size_x + j];
        }
};

template <typename T>
class array2d {
    public:
        size_t size_x, size_y;
        T * mvec;

    public:
        array2d(
            size_t a_size_x,
            size_t a_size_y
        ) :
            size_x(a_size_x),
            size_y(a_size_y)
        {
            mvec = new T[size_x * size_y];
        }

        ~array2d() {
            delete[] mvec;
        }

    public:
        T & operator()(size_t i, size_t j) {
            return mvec[i * size_x + j];
        }
};

double vectorvector() {
    std::vector< std::vector<double> > x(ARRAYSIZE, std::vector<double>(ARRAYSIZE));

    for (size_t i = 0; i < ARRAYSIZE; i++) {
        for (size_t j = 0; j < ARRAYSIZE; j++) {
            x[i][j] = static_cast<double>(i) * static_cast<double>(j);
        }
    }

    return x[ARRAYSIZE/2][ARRAYSIZE/2];
}

double vector_contiguous() {
    std::vector<double> x(ARRAYSIZE*ARRAYSIZE);

    for (size_t i = 0; i < ARRAYSIZE; i++) {
        for (size_t j = 0; j < ARRAYSIZE; j++) {
            x[i * ARRAYSIZE + j] = static_cast<double>(i) * static_cast<double>(j);
        }
    }

    return x[(ARRAYSIZE/2)*ARRAYSIZE + (ARRAYSIZE/2)];
}

double vector2d_class() {
    vector2d<double> x(ARRAYSIZE, ARRAYSIZE);

    for (size_t i = 0; i < ARRAYSIZE; i++) {
        for (size_t j = 0; j < ARRAYSIZE; j++) {
            x(i,j) = static_cast<double>(i) * static_cast<double>(j);
        }
    }

    return x(ARRAYSIZE/2,ARRAYSIZE/2);
}

double array2d_class() {
    array2d<double> x(ARRAYSIZE, ARRAYSIZE);

    for (size_t i = 0; i < ARRAYSIZE; i++) {
        for (size_t j = 0; j < ARRAYSIZE; j++) {
            x(i,j) = static_cast<double>(i) * static_cast<double>(j);
        }
    }

    return x(ARRAYSIZE/2,ARRAYSIZE/2);
}

int main() {

    std::chrono::time_point<std::chrono::high_resolution_clock> start, stop;
    microseconds duration;

    start = high_resolution_clock::now();
    std::cout << vectorvector() << std::endl;
    stop = high_resolution_clock::now();
    duration = duration_cast<microseconds>(stop - start);
 
    std::cout << "vectorvector: " << duration.count() << " microseconds" << std::endl;

    start = high_resolution_clock::now();
    std::cout << vector_contiguous() << std::endl;
    stop = high_resolution_clock::now();
    duration = duration_cast<microseconds>(stop - start);

    std::cout << "vector_contiguous: " << duration.count() << " microseconds" << std::endl;

    start = high_resolution_clock::now();
    std::cout << vector2d_class() << std::endl;
    stop = high_resolution_clock::now();
    duration = duration_cast<microseconds>(stop - start);

    std::cout << "vector2d_class: " << duration.count() << " microseconds" << std::endl;

    start = high_resolution_clock::now();
    std::cout << array2d_class() << std::endl;
    stop = high_resolution_clock::now();
    duration = duration_cast<microseconds>(stop - start);

    std::cout << "array2d_class: " << duration.count() << " microseconds" << std::endl;

}

Resulting timings were as follows:

vectorvector: 3291998 microseconds
vector_contiguous: 2238134 microseconds
vector2d_class: 2221989 microseconds
array2d_class: 1544225 microseconds

The vectorvector solution is a common one online, and I've seen several claims on StackOverflow that with modern compilers this is likely to be equally performant as a contiguous 1D array. However, this seems not to be the case, with vectorvector being 45% slower than the contiguous vector solution.

But perhaps most surprising to me is that the array2d option is around 45% faster than an analogous solution using a vector, even with -O3. Following Vectors and Arrays in C++ I also tried with -finline-functions and found roughly similar performance.

Anyway, my questions are: Are these results consistent with expectations? Are there any obvious flaws in this experiment? Are there alternative solutions that should be pursued? Why does STL not have a multidimensional vector type?

Solution

As others have explained in the comments, 1D-arrays/memory are very much faster today than vector<vector>>, sometimes even orders of magnitude so. I am actually surprised that it's only 45% slower than continous vector. I'm assuming that, since you initialize the containers in the function, it will already have most data in the cache, and thus be faster than expected. If you actually want to benchmark a case where the containers are initialized directly as they are used, that might be fine, but for many situations where they are accessed later, it would be a good idea to separate those operations, and make sure that the cache is invalidated between running the actual loops.

But perhaps most surprising to me is that the array2d option is around 45% faster than an analogous solution using a vector, even with -O3. Following Vectors and Arrays in C++ I also tried with -finline-functions and found roughly similar performance.

This is pretty certainly due to the fact that vector<> initialized it's content when resized, as mentioned in the comments- though this part also applies to 1d-vectors, and also when calling the constructor that takes a size-value. In certain scenarios, the compiler might be able to optimize this away, but I think your case is too complex for it to understand that you are effectively overwriting the whole array (or it has other reasons for still deciding to initialize it). Looking at the assembly from clang (-O3) in godbolt:

movabs r14, 5000000000
mov rdi, r14
call operator new(unsigned long)@PLT
mov rbx, rax
xor r15d, r15d
mov rdi, rax
xor esi, esi
mov rdx, r14
call memset@PLT

This is the first thing called for your vector-continous case. This means that before the whole loop, "memset" is called to initialize the entire array to 0, for all 5000000000 elements. That explains a difference between an array that you allocate yourself.

This is in my opinion a pretty severe limitation of vector-at least that you don't have an option to prevent it. std::string recently added resize_and_overwrite (https://en.cppreference.com/w/cpp/string/basic_string/resize_and_overwrite), which has a bit of a clumsy syntax IMHO, but achieves exactly that: you save all the overhead of 0-initializing the string, for cases where you know that you will overwrite it anyway. This provided some real speedups in code of mine that deals with file-loading code, which temporarily used a std::string as a buffer.

As an aside, I've personally used my own "vector" implementation for quite some while. I've gone the more direct approach and added a ctor-overload, that allows memory to be uninitalized:

sys::Vector<double> vectorContinous(VECTOR_SIZE * VECTOR_SIZE, sys::UninitializedHint); // behaves like "new double[]" for allocating the memory, without initialization

Of course, this will only compile with types that are trivial. But this also proves to give some noticable speedups, when using a vector similar to what you propose. There are reasons for why std::string went with the resize_and_overwrite approach, explained in the paper; so I'm not necessarily saying vector must adopt what I proposed. But something like that for std::vector would be useful IMHO.