Reduction in OpenMP with GMP/ARB matrices

I want to parallelize a program that I have written which calculates a series involving matrix and vector products with the result being a vector. Since arguments become very small and large, I use ARB (based on GMP, MPFR and flint) to prevent Loss of significance. Also, since series elements are independent, matrix dimensions are not big, but the series needs to be evaluated upto 50k elements, it makes sense to have a number of threads compute a few elements of the series each, i.e. 5 threads could each compute 10k elements in parallel and then add up the resulting vectors.

The issue is now that the ARB function to add up vectors and matrices is not a standard operation that can be used withtin an openmp reduction easily. When I naively try to write a custom reduction, g++ complains about the void type, since operations in ARB do not have a return value:

void arb_add(arb_t z, const arb_t x, const arb_t y, slong prec)¶

will calculate and set z to z=x+y with a precision of prec bits, but arb_add itself is s a void function.

As an example: a random for-loop for a similar problem looks like this (my actual program is different of course)

[...]

arb_mat_t RMa,RMb,RMI,RMP,RMV,RRes;

arb_mat_init(RMa,Nmax,Nmax); //3 matrices
arb_mat_init(RMb,Nmax,Nmax);
arb_mat_init(RMI,Nmax,Nmax);
arb_mat_init(RMV,Nmax,1);  // 3 vectors
arb_mat_init(RMP,Nmax,1);
arb_mat_init(RRes,Nmax,1);

[...]

//Result= V + ABV +(AB)^2 V + (AB)^3 V + ...
//in my actual program A and B would be j- and k-dependent and would
//include more matrices and vectors

#pragma omp parallel for collapse(1) private(j)
for(j=0; j<jmax; j++){
        arb_mat_one(RMI); //sets the matrix RMI to 1
        for(k=0; k<j; k++){ 
            Qmmd(RMI,RMI,RMa,Nmax,prec); //RMI=RMI*RMa
            Qmmd(RMI,RMI,RMb,Nmax,prec); //RMI=RMI*RMb
            cout << "j=" << j << ",   k=" << k << "\r" << flush;
        }

        Qmvd(RMP,RMI,RMV,Nmax,prec);     //RMP=RMI*RMV
        arb_mat_add(RRes,RRes,RMP,prec); //Result=Result + RMP        
}

[...]

which of course breaks down when using more than 1 thread because I did not specify a reduction on RRes. Here Qmmd() and Qmvd() are self-written matrix-matrix and matrix-vector product functions, and RMa, RMb, and RMV are random matrices and vectors, resp.

The idea is now to reduce RRes, such that each thread can compute a private version of RRes including a fraction of the final result, before adding them all up using arb_mat_add. I could write a function matrixadd(A,B) to compute A=A+B

void matrixadd(arb_mat_t A, arb_mat_t B) {
    arb_mat_add(A,A,B,2000); 
//A=A+B, the last value is the precision in bits used for that operation
}

and then eventually

#pragma omp declare reduction \
(myadd : void : matrixadd(omp_out, omp_in))
#pragma omp parallel for private(j) reduction(myadd:RRes) 
for(j=0; j<jmax; j++){
        arb_mat_one(RMI);
        for(k=0; k<j; k++){
            Qmmd(RMI,RMI,RMa,Nmax,prec);
            Qmmd(RMI,RMI,RMb,Nmax,prec);
            cout << "j=" << j << ",   k=" << k << "\r" << flush;
        }

        Qmvd(RMP,RMI,RMV,Nmax,prec); 
        matrixadd(RRes,RMP);        
}

Gcc is not happy with this:

main.cpp: In function ‘int main()’:
main.cpp:503:46: error: invalid use of void expression
     (myadd : void : matrixadd(omp_out, omp_in))
                                          ^
main.cpp:504:114: error: ‘RRes’ has invalid type for ‘reduction’

Can Openmp understand my void reduction and can be made to work with ARB and GMP? If so, how? Thanks!

(Also, my program currently includes a convergence check with a break condition in the j-for-loop. If you also know how to easily implement such a thing, too, I would be very grateful because for my current openmp tests I just removed the break and set a constant jmax.)

My question is very similar to this one.

Edit: Sorry, here is my try of a minimal, complete and verifiable example. Additional required packages are arb, flint, gmp, mpfr (available through packetmanagers) and gmpfrxx.

#include <iostream>
#include <omp.h>
#include <cmath>
#include <ctime>

#include <cmath>
#include <gmp.h>
#include "gmpfrxx/gmpfrxx.h"

#include "arb.h"
#include "acb.h"
#include "arb_mat.h"

using namespace std;

void generate_matrixARBdeterministic(arb_mat_t Mat, int N, double w2) //generates some matrix
{
    int i,j;
    double what;
    for(i=0;i<N;i++)
    {
        for(j=0;j<N;j++)
        {
            what=(i*j+30/w2)/((1+0.1*w2)*j+20-w2);
            arb_set_d(arb_mat_entry(Mat,i,j),what);
        }
    }
}

void generate_vecARBdeterministic(arb_mat_t Mat, int N) //generates some vector
{
    int i;
    double what;
    for(i=0;i<N;i++)
    {
        what=(4*i*i+40)/200;
        arb_set_d(arb_mat_entry(Mat,i,0),what);
    }
}

void Qmmd(arb_mat_t res, arb_mat_t MA, arb_mat_t MB, int NM, slong prec)
{   ///res=M*M=Matrix * Matrix

    arb_t Qh1;
    arb_mat_t QMh;

    arb_init(Qh1);

    arb_mat_init(QMh,NM,NM);

    for (int i=0; i<NM; i++){
            for(int j=0; j<NM; j++){
                     for (int k=0; k<NM; k++ ) {
                        arb_mul(Qh1,arb_mat_entry(MA, i, k),arb_mat_entry(MB, k, j),prec);
                        arb_add(arb_mat_entry(QMh, i, j),arb_mat_entry(QMh, i, j),Qh1,prec);
                    }
             }
    }

    arb_mat_set(res,QMh);

    arb_mat_clear(QMh);
    arb_clear(Qh1);
  }

void Qmvd(arb_mat_t res, arb_mat_t M, arb_mat_t V, int NM, slong prec)  //res=M*V=Matrix * Vector
{ ///res=M*V
    arb_t Qh,Qs;
    arb_mat_t QMh;

    arb_init(Qh);
    arb_init(Qs);
    arb_mat_init(QMh,NM,1);

    arb_set_ui(Qh,0.0);
    arb_set_ui(Qs,0.0);
    arb_mat_zero(QMh);

    for (int i=0; i<NM; i++){
            arb_set_ui(Qs,0.0);
            for(int j=0; j<NM; j++){
                    arb_mul(Qh,arb_mat_entry(M, i, j),arb_mat_entry(V, j, 0),prec);
                    arb_add(Qs,Qs,Qh,prec);
             }
            arb_set(arb_mat_entry(QMh, i, 0),Qs);
    }
    arb_mat_set(res,QMh);

    arb_mat_clear(QMh);
    arb_clear(Qh);
    arb_clear(Qs);
}

void QPrintV(arb_mat_t A, int N){ //Prints Vector
    for(int i=0;i<N;i++){
                cout <<  arb_get_str(arb_mat_entry(A, i, 0),5,0) << endl; //ARB_STR_NO_RADIUS
    }
}

void matrixadd(arb_mat_t A, arb_mat_t B) {
    arb_mat_add(A,A,B,2000);
}

int main() {
    int Nmax=10,jmax=300; //matrix dimension and max of j-loop
    ulong prec=2000; //precision for arb

    //initializations

    arb_mat_t RMa,RMb,RMI,RMP,RMV,RRes;

    arb_mat_init(RMa,Nmax,Nmax);
    arb_mat_init(RMb,Nmax,Nmax);
    arb_mat_init(RMI,Nmax,Nmax);
    arb_mat_init(RMV,Nmax,1);
    arb_mat_init(RMP,Nmax,1);
    arb_mat_init(RRes,Nmax,1);

    omp_set_num_threads(1);
    cout << "Maximal threads is " << omp_get_max_threads() << endl;

    generate_matrixARBdeterministic(RMa,Nmax,1.0); //generates some Matrix for RMa
    arb_mat_set(RMb,RMa); // sets RMb=RMa

    generate_vecARBdeterministic(RMV,Nmax); //generates some vector

    double st=omp_get_wtime();

    Qmmd(RMI,RMa,RMb,Nmax,prec);
    int j,k=0;

    #pragma omp declare reduction \
    (myadd : void : matrixadd(omp_out, omp_in))
    #pragma omp parallel for private(j) reduction(myadd:RRes)
    for(j=0; j<jmax; j++){
            arb_mat_one(RMI);
            for(k=0; k<j; k++){
                Qmmd(RMI,RMI,RMa,Nmax,prec);
                Qmmd(RMI,RMI,RMb,Nmax,prec);
                cout << "j=" << j << ",   k=" << k << "\r" << flush;
            }

            Qmvd(RMP,RMI,RMV,Nmax,prec);  
            matrixadd(RRes,RMP);      
    }

    QPrintV(RRes,Nmax);

    double en=omp_get_wtime();
    printf("\n Time it took was %lfs\n",en-st);


    arb_mat_clear(RMa);
    arb_mat_clear(RMb);
    arb_mat_clear(RMV);
    arb_mat_clear(RMP);
    arb_mat_clear(RMI);
    arb_mat_clear(RRes);

    return 0;
}

and

g++ test.cpp -g -fexceptions -O3 -ltbb -fopenmp -lmpfr -lflint -lgmp -lgmpxx -larb -I../../PersonalLib -std=c++14 -lm -o b.out

Solution

You can do the reduction by hand like this

#pragma omp parallel
{
    arb_mat_t RMI, RMP;
    arb_mat_init(RMI,Nmax,Nmax);  //allocate memory
    arb_mat_init(RMP,Nmax,1);     //allocate memory
    #pragma omp for
    for(int j=0; j<jmax; j++){
        arb_mat_one(RMI);
        for(int k=0; k<j; k++){
            Qmmd(RMI,RMI,RMa,Nmax,prec);
            Qmmd(RMI,RMI,RMb,Nmax,prec);
        }
    }
    Qmvd(RMP,RMI,RMV,Nmax,prec);
    #pragma omp critical
    arb_mat_add(RRes,RRes,RMP,prec);
    arb_mat_clear(RMI);  //deallocate memory
    arb_mat_clear(RMP);  //deallocate memory
}

If you want to use declare reduction you need to make a C++ wrapper for arb_mat_t. Using declare reduction lets OpenMP decide how to do the reduction. But I highly doubt you will find a case where this gives better performance than the manual case.