openmplinear-algebrasparse-matrixscientific-computing# How to parallelize the sparse matrix-matrix (SpMM) product with OpenMP for the COO format?

I recently coded a multithreaded version of the sparse matrix-vector (SpMV) product using a COO storage format:

```
void dcoomv(SparseMatrixCOO *A, double *x, double *y) {
for (int i=0; i<A->m; i++) y[i] = 0.;
#pragma omp parallel for reduction(+:y[:A->m])
for (int i=0; i<A->nnz; i++) {
y[A->row_idx[i]] += A->val[i] * x[A->col_idx[i]];
}
}
```

Considering the potential for race conditions of the COO SpMV, this is about as good as it gets and it does not so bad. Actually, it does better than MKL's `mkl_sparse_d_mv`

with COO, which I think is not parallelized at all.

Now, I would like to parallelize the sparse matrix-matrix (SpMM) product. I did two attempts. First, I wrote

```
void dcoomm_pll1(SparseMatrixCOO *A, double *X, double *Y, int nvec) {
for (int i=0; i<(A->m * nvec); i++) Y[i] = 0.;
#pragma omp parallel for reduction(+:Y[:A->m*nvec])
for (int i=0; i<A->nnz; i++) {
for (int vec=0; vec<nvec; vec++) {
Y[A->row_idx[i] + vec * A->m] += A->val[i] * X[A->col_idx[i] + vec * A->n];
}
}
}
```

This, however, does not speedup but rather slows down the code in parallel.

Second, I did

```
void dcoomm_pll2(SparseMatrixCOO *A, double *X, double *Y, int nvec) {
for (int i=0; i<(A->m * nvec); i++) Y[i] = 0.;
for (int i=0; i<A->nnz; i++) {
#pragma omp parallel for
for (int vec=0; vec<nvec; vec++) {
Y[A->row_idx[i] + vec * A->m] += A->val[i] * X[A->col_idx[i] + vec * A->n];
}
}
}
```

which, again, leads to some slowdown.

For reminder, the sequential implementation of `dcoomm`

is given by

```
void dcoomm(SparseMatrixCOO *A, double *X, double *Y, int nvec) {
for (int i=0; i<(A->m * nvec); i++) Y[i] = 0.;
for (int i=0; i<A->nnz; i++) {
for (int vec=0; vec<nvec; vec++) {
Y[A->row_idx[i] + vec * A->m] += A->val[i] * X[A->col_idx[i] + vec * A->n];
}
}
}
```

My sequential version is about 30% faster than MKL's `mkl_sparse_d_mm`

with COO. However, both my multithreaded versions are much slower than `mkl_sparse_d_mm`

.

As I tried parallelizing the COO SpMM function with both `dcoomm_pll1`

and `dcoomm_pll2`

, none of which offers satisfactory performance, I wonder how to parallelize the COO SpMM so as to improve parallel perfomance.

Solution

The memory access pattern is not contiguous in your matrix-matrix multiplication code, therefore leading to more cache misses. Cache misses are already bad for sequential code, and even worse for multithreaded codes. Why not calling nvec times a sequential version of dcoomv, and parallelize this loop? You don't even need a reduction.

```
void dcoomm_pll3(SparseMatrixCOO *A, double *X, double *Y, int nvec) {
#pragma omp parallel for
for (int vec=0; vec<nvec; vec++) {
dcoomv_seq(A, X + vec * A->n, Y + vec * A->m);
}
}
```

- Use OpenMP in C++11 to find the maximum of the calculated values
- OMP library not found in /usr/local/lib via cmake
- Core isolation limits OpenMP to a single core
- Why does one thread in privatethread() share the same memory address as the global variable in OpenMP?
- Understanding OpenMP shortcomings regarding fork
- openMP reduction on struct
- Output of the number of iterations processed by each thread
- Executing a particular function on specified core using openMP
- How to make OpenMP thread or task run on a certain core
- OpenMP: Get total number of running threads
- How to set the number of CPU cores engaged in a calculation (C++ and OpenMP)
- How to include omp.h in OS X?
- How minimize "false sharing" in this array processing with OpenMP?
- Persistent Data Race Warnings in C++ Parallel Program despite Mutex Usage
- std::sort is way more fast than custom OpenMP parallel sorting algorithm
- How does work the OpenMP "nonmonotonic:dynamic" schedule?
- How to vectorize this loop with OpenMP simd?
- OpenMP lock instruction in a function called by a parallelised loop
- How to speed up dense matrix multiplication?
- tensorflow_cc results in Undefined reference to `omp_in_parallel@VERSION'
- Compile OpenMP programs with gcc compiler on OS X Yosemite
- Increase speedup with OpenMP and reductions
- Processing a shared array by a passed index in a subroutine in a parallel loop
- What is the execution time of a parallel OpenMP application?
- Lower than expected speedup when using multithreading
- How to turn on OpenMP when using Qt creator
- How to use a local version of openMP? (GOMP)
- Can this loop in the Arithmetic Numbers Calculation Problem be parallelised?
- Parallel for loop in openmp
- What happens when clang-built program links against both libomp and libgomp?