Reduction with OpenMP: Matrix/tensor operations examples

I am trying to learn how to use reduction with OpenMP and was wondering if the following examples can be rewritten with reduction?

Example A: Finding the sum of three matrices/tensors

#include <Eigen/Dense>
#include <Eigen/SparseCore> 
#include <Eigen/Sparse>
static const int nx = 128; 
static const int ny = 128; 
static const int nz = 128;

Eigen::Tensor<std::complex<double>, 3> Tensor1(nx,ny,nz);   
Eigen::Tensor<std::complex<double>, 3> Tensor2(nx,ny,nz);    
Eigen::Tensor<std::complex<double>, 3> Tensor3(nx,ny,nz); 
for(int i = 0; i< nx; i++){
        for(int j = 0; j< ny; j++){
            for(int k = 0; k< nz; k++){ 
                Sum(k,i,j) = (Tensor1(k,i,j) + Tensor2(k,i,j) + Tensor3(k,i,j));

            }
        }
    }

Example B: Finding separate terms of the cross product of (AxB):

#include <Eigen/Dense>
#include <Eigen/SparseCore> 
#include <Eigen/Sparse>
static const int nx = 128; 
static const int ny = 128; 
static const int nz = 128;

double B[3] = {Bx,By,Bz};

Eigen::Tensor<std::complex<double>, 3> Ax(nx,ny,nz);  
Eigen::Tensor<std::complex<double>, 3> Ay(nx,ny,nz); 
Eigen::Tensor<std::complex<double>, 3> Az(nx,ny,nz); 

Eigen::Tensor<std::complex<double>, 3> Termx(nx,ny,nz);   
Eigen::Tensor<std::complex<double>, 3> Termy(nx,ny,nz);    
Eigen::Tensor<std::complex<double>, 3> Termz(nx,ny,nz);    

    for(int i = 0; i< nx; i++){
        for(int j = 0; j< ny; j++){
            for(int k = 0; k< nz; k++){ 
                    Termx(k,i,j) = (Az(k,i,j) * By - Ay(k,i,j) * Bz);
                    Termy(k,i,j) =  (Ax(k,i,j) * Bz - Az(k,i,j) * Bx);
                    Termz(k,i,j) =  (Ay(k,i,j) * Bx - Ax(k,i,j) * By);


            }
        }
    }

I have not seen examples of reduction with either nested loops or matrices/tensors so I am not sure if the above examples work with reduction or not. Thanks!

Solution

The short answer is no.

The longer answer is that your examples don't fit with the basic idea of what reduction is intended to accomplish.

Reduction is typically used for a case where all the threads write to a single output. For a really simplistic example, consider normalizing a large matrix, by simply adding up all the numbers, then dividing each by the sum so they're all in the range 0..1.

#pragma omp parallel for
for (int i=0; i<n; i++)
    for (int j=0; j<m; j++)
        for (int k=0; k<o; k++)
            sum += matrix[i][j][k];

This has a problem: all the threads write to sum. To keep meaningful results, the OpenMP runtime serializes access to sum, and we gain very little (if anything) from all the threads running in parallel. Depending on the implementation, we might easily be locking and unlocking a mutex every iteration, in which case parallel code might even be quite a lot slower than serial code.

We can use reduction to fix that:

#pragma omp parallel for reduction(+:sum)
for (int i=0; i<n; i++)
    for (int j=0; j<m; j++)
        for (int k=0; k<o; k++)
            sum += matrix[i][j][k];

With this, each thread gets a private variable where it stores its local sum. Then when the threads finish, it does the reduction (using the operator we specified--+ in this case) and adds all those separate values together.

reduction helps when all the threads work together to produce a single value, and we can avoid pressure on that one value by having each thread produce a value independently, and afterwards combine all those independent values.

That doesn't really work in your case, because each of your writes is already to a separate variable from that used by any other thread.