openmp increasing number of threads increases the execution time

I'm implementing sparse matrices multiplication(type of elements std::complex) after converting them to CSR(compressed sparse row) format and I'm using openmp for this, but what I noticed that increasing the number of threads doesn't necessarily increase the performance, sometimes is totally the opposite! why is that the case? and what can I do to solve the issue?

typedef std::vector < std::vector < std::complex < int >>> matrix;

struct CSR {
    std::vector<std::complex<int>> values; //non-zero values
    std::vector<int> row_ptr; //pointers of rows
    std::vector<int> cols_index; //indices of columns
    int rows; //number of rows
    int cols; //number of columns
    int NNZ; //number of non_zero elements
};

const matrix multiply_omp (const CSR& A,
    const CSR& B,const unsigned int num_threds=4) {
    if (A.cols != B.rows)
        throw "Error";
    CSR B_t = sparse_transpose(B);
    omp_set_num_threads(num_threds);
    matrix result(A.rows, std::vector < std::complex < int >>(B.cols, 0));
    #pragma omp parallel
    {
        int i, j, k, l;
        #pragma omp for
        for (i = 0; i < A.rows; i++) {
            for (j = 0; j < B_t.rows; j++) {
                std::complex < int > sum(0, 0);
                for (k = A.row_ptr[i]; k < A.row_ptr[i + 1]; k++)
                    for (l = B_t.row_ptr[j]; l < B_t.row_ptr[j + 1]; l++)
                        if (A.cols_index[k] == B_t.cols_index[l]) {
                            sum += A.values[k] * B_t.values[l];
                            break;
                        }
                if (sum != std::complex < int >(0, 0)) {
                    result[i][j] += sum;
                }
            }
        }
    }
    return result;
}

Solution

You can try to improve the scaling of this algorithm, but I would use a better algorithm. You are allocating a dense matrix (wrongly, but that's beside the point) for the product of two sparse matrices. That's wasteful since quite often the project of two sparse matrices will not be dense by a long shot.

Your algorithm also has the wrong time complexity. The way you search through the rows of B means that your complexity has an extra factor of something like the average number of nonzeros per row. A better algorithm would assume that the indices in each row are sorted, and then keep a pointer for how far you got into that row.

Read the literature on "Graph Blas" for references to efficient algorithms.