Search code examples
cvectoropenmpmatrix-multiplication

OpenMP parallel multiplication slower than Sequential multiplication


I'm learning OpenMP and I'm trying to do a simple task: A[r][c] * X[c] = B[r] (matrix vector multiplication). The problem is: the sequential code is faster than parallel and I don't know why! My code:

#include <omp.h>
#include <time.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/wait.h>
#include <sys/time.h>
#include <sys/types.h>


// Defined variables
#define row_matriz_A 80000
#define col_matriz_A 800
#define THREADS_NUM 4

// FUNCAO - GERAR MATRIZES
void gerarMatrizes(int r, int c, int mA[], int vX[], int vB[]){...}

// FUNCAO - SEQUENTIAL MULTIPLICATION
void multSequencial(int r, int c, int mA[], int vX[], int vB[]){
    // Variables
    int i, j, offset, sum;                        
    struct timeval tv1,tv2;  
    double t1, t2;        
    // Begin Time
    gettimeofday(&tv1, NULL);
    t1 = (double)(tv1.tv_sec) + (double)(tv1.tv_usec)/ 1000000.00;
    for(i = 0; i < r; i++){
        sum = 0;
        for(j = 0; j < c; j++){
            offset = i * c + j;
            sum += mA[offset] * vX[j];
        }
        vB[i] = sum;
    }
    // End time
    gettimeofday(&tv2, NULL);
    t2 = (double)(tv2.tv_sec) + (double)(tv2.tv_usec)/ 1000000.00;
    printf("\nO tempo de execucao sequencial foi: %lf segundos.\n", (t2 - t1));
    return;
}

// FUNCAO - MULTIPLICACAO PARALELA COM OpenMP
void matvecHost(int r, int c, int mA[], int vX[], int vB[]){
    // Variaveis
    int tID, i, j, offset, sum;
    struct timeval tv1, tv2;
    double t1, t2;
    // Init vB
    for(i = 0; i < r; i++) vB[i] = 0;
    // BEGIN Time
    gettimeofday(&tv1, NULL);
    t1 = (double)(tv1.tv_sec) + (double)(tv1.tv_usec)/ 1000000.00;
    omp_set_num_threads(THREADS_NUM);
    #pragma omp parallel private(tID, i, j) shared(mA, vB, vX)
    {
        tID = omp_get_thread_num();     
        #pragma omp for
            for(i = 0; i < r; i++){
                sum = 0;
                for(j = 0; j < c; j++){
                    offset = i * c + j;
                    sum += mA[offset] * vX[j];
                }
                vB[i] = sum;
            }
    }
    // End time
    gettimeofday(&tv2, NULL);
    t2 = (double)(tv2.tv_sec) + (double)(tv2.tv_usec)/ 1000000.00;
    printf("\nO tempo de execucao OpenMP foi: %lf segundos.\n", (t2 - t1));
    return;
}

// FUNCAO - PRINCIPAL
int main(int argc, char * argv[]) {
    int row, col;
    row = row_matriz_A;
    col = col_matriz_A;
    int *matrizA = (int *)calloc(row * col, sizeof(int));
    int *vectorX = (int *)calloc(col * 1, sizeof(int));
    int *vectorB = (int *)calloc(row * 1, sizeof(int));
    gerarMatrizes(row, col, matrizA, vectorX, vectorB);                    
    multSequencial(row, col, matrizA, vectorX, vectorB);
    matvecHost(row, col, matrizA, vectorX, vectorB);
    return 0;
}

Previous solutions that did not worked:

  • Use collapse in my squared for
  • Increse rows and columns size
  • Increase thread numbers (A teacher recommend to use thread number == threads physical number)
  • Use malloc instead of m[i][j]

EDIT - ANSWER

My parallel block was correctly changed based on the correct answer:

#pragma omp parallel private(i, j, sum) shared(mA, vB, vX)
{
    #pragma omp for
        for(i = 0; i < r; i++){
            sum = 0;
            for(j = 0; j < c; j++){
                sum += mA[i * c + j] * vX[j];
            }
            vB[i] = sum;
        }
}

I still got some a doubt:

  • If I define i, j and sum inside my parallel block, they will be set as private automatically? This improve the speed in my code or not?

Solution

  • You have race conditions on sum and offset - those are shared between the threads instead of being thread-private.

    This also likely explains the slowdown: On x86, the CPU will actually work hard to make sure accesses to shared variables "work". This involves flushing cache lines after every (!) write to offset and sum - so all the threads are wildly writing into the same variables, but each one has to wait until the write from the previous thread (on a different core) has arrived in the local cache again after having been flushed. And of course it will produce completely nonsensical results.

    I don't know why you are declaring all your variables at the start of the function - that's prone to these kind of mistakes. If you declared i, j, sum and offset (and the unused tID) in the smallest possible scopes instead, you wouldn't ever had this problem because they would be thread-private automatically in that case.