Initialising two large 2D array of 401X401 and do fast matrix multiplication in C++

I want to initialise two 2D matrices of size 401X401 and multiply them in a speedy way. But most probably due to stack overflow, two double 2D matrices were not initialised as stated in this question: No output after new arrays being initialized in C++.

After following suggestions, I used vectors of vectors to store my 2D matrix. But I wanted to do fast matrix multiplication as time is a significant factor for me. There were suggestions not to use vectors of vectors for this purpose: How can we speedup matrix multiplication where matrices are initialized using vectors of vectors (2D vector) in C++.

Maybe I can convert again to array, but I feel again there would be StackOverflow!! How can I do both the task of initialising two large matrices and matrix multiplication is also fast? If I stick to vectors of vectors, there is no way to use abilities of inbuilt libraries such as MKL etc.

Solution

This can be sped up considerably.

First, to eliminate the stack usage issue declare the matrix like so:

std::vector<std::array<double,401>> matrix(401);

This achieves cache coherence since all the matrix memory is contiguous.

Next, transpose one of the matrixes. This will make the multiplication proceed in sequential memory accesses.

And finally, this lends itself to further speed improvements by multithreading. For instance create 4 threads that each process 100,100,100,101 rows. No thread sync required. since all writes are specific to each thread. Just join them all at the end and you're done.

I was curious and hacked some code to time 4 different conditions.

Simple matrix multiply
Matrix multiply with transpose for memory efficiency.
Matrix multiply with 4 threads
Matrix Multiply with 4 threads and transpose

The results for a vector<vector> v vector<array> are:

  vector<array> structure
Basic multiply          0.0955 secs.
Multiply with transpose 0.0738 secs.
4 Threads               0.0268 secs.
4 Threads w transpose   0.0164 secs.

  vector<vector> structure
Basic multiply          0.1263 secs.
Multiply with transpose 0.0739 secs.
4 Threads               0.0346 secs.
4 Threads w transpose   0.0167 secs.


Matrix product(Matrix const& v0, Matrix const& v1, double& time, int mode = 0)
{
    Matrix ret = create_matrix();
    Matrix v2 = create_matrix();
    Timer timer;
    if (mode == 0)  // Basic matrix multiply
    {
        for (size_t row = 0; row < N; row++)
            for (size_t col = 0; col < N; col++)
                for (size_t col_t = 0; col_t < N; col_t++)
                    ret[row][col] += v0[row][col_t] * v1[col_t][col];
    }
    else if (mode == 1) // Matrix multiply after transposing v1 for better memory access
    {
        for (size_t r = 0; r < N; r++)      // transpose first
            for (size_t c = 0; c < N; c++)
                v2[c][r] = v1[r][c];
        for (size_t row = 0; row < N; row++)
            for (size_t col = 0; col < N; col++)
                for (size_t col_t = 0; col_t < N; col_t++)
                    ret[row][col] += v0[row][col_t] * v2[col][col_t];
    }
    else if (mode==2) // Matrix multiply using threads
    {
        // lambda to process sets of rows in separate threads
        auto do_row_set = [&v0, &v1, &ret](size_t start, size_t last) {
            for (size_t row = start; row < last; row++)
                for (size_t col = 0; col < N; col++)
                    for (size_t col_t = 0; col_t < N; col_t++)
                        ret[row][col] += v0[row][col_t] * v1[col_t][col];
        };

        vector<size_t> seq;
        const size_t NN = N / THREADS;
        // make a sequence of NN+1 rows from start to end
        for (size_t thread_n = 0; thread_n < N; thread_n += NN)
            seq.push_back(thread_n);
        seq.push_back(N);
        vector<std::future<void>> results; results.reserve(NN + 1);
        for (size_t i = 0; i < seq.size() - 1; i++)
            results.emplace_back(std::async(launchType, do_row_set, seq[i], seq[i + 1]));
        for (auto& x : results)
            x.get();
    }
    else
    {
        for (size_t r = 0; r < N; r++)      // transpose first
            for (size_t c = 0; c < N; c++)
                v2[c][r] = v1[r][c];
        // lambda to process sets of rows in separate threads
        auto do_row_set = [&v0, &v2, &ret](size_t start, size_t last) {
            for (size_t row = start; row < last; row++)
                for (size_t col = 0; col < N; col++)
                    for (size_t col_t = 0; col_t < N; col_t++)
                        ret[row][col] += v0[row][col_t] * v2[col][col_t];
        };

        vector<size_t> seq;
        const size_t NN = N / THREADS;
        // make a sequence of NN+1 rows from start to end
        for (size_t thread_n = 0; thread_n < N; thread_n += NN)
            seq.push_back(thread_n);
        seq.push_back(N);
        vector<std::future<void>> results; results.reserve(NN + 1);
        for (size_t i = 0; i < seq.size() - 1; i++)
            results.emplace_back(std::async(launchType, do_row_set, seq[i], seq[i + 1]));
        for (auto& x : results)
            x.get();
    }
    time = timer;
    return ret;
}

bool operator==(Matrix const& v0, Matrix const& v1)
{
    for (size_t r = 0; r < N; r++)
        for (size_t c = 0; c < N; c++)
            if (v0[r][c] != v1[r][c])
                return false;
    return true;
}

int main()
{
    auto fill = [](Matrix& v) {
        std::mt19937_64 r(1);
        std::normal_distribution dist(1.);
        for (size_t row = 0; row < N; row++)
            for (size_t col = 0; col < N; col++)
                v[row][col] = Float(dist(r));
    };
    Matrix m1 = create_matrix(), m2 = create_matrix(), m3 = create_matrix(),
           m4 = create_matrix(), m5 = create_matrix(), m6 = create_matrix();
    fill(m1); fill(m2);

    auto process_test = [&m1, &m2](int mode, Matrix& out) {
        const int rpt_count = 50;
        double sum = 0;
        for (int i = 0; i < rpt_count; i++)
        {
            double time;
            out = product(m1, m2, time, mode);
            sum += time / rpt_count;
        }
        return sum;
    };

    std::cout << std::fixed << std::setprecision(4);
    double time{};
    time = process_test(0, m3);
    std::cout << "Basic multiply          " << time << " secs.\n";
    time = process_test(1, m4);
    std::cout << "Multiply with transpose " << time << " secs.\n";
    time = process_test(2, m5);
    std::cout << "4 Threads               " << time << " secs.\n";
    time = process_test(3, m6);
    std::cout << "4 Threads w transpose   " << time << " secs.\n";
    if (!(m3==m4 && m3==m5 && m3==m6))
        std::cout << "Error\n";
}