OpenMP tasks execution time varies depending on where I put the pragmas

I'm trying to parallelize some code using OpenMP (and MPI) using tasks. I have the following code:

double t_copy = 0, t_forward = 0, t_backward = 0, t_diag = 0;
void pc_ssor_poisson3d(int N, void *data,
                       double *restrict Ax,
                       double *restrict x)
{
// clocks for timing
#define COPY_CLOCK 20
#define FW_SSOR_CLOCK 21
#define DIAG_SSOR_CLOCK 22
#define BW_SSOR_CLOCK 23

    pc_ssor_p3d_t *ssor_data = (pc_ssor_p3d_t *)data;
    int n = ssor_data->n;
    double w = ssor_data->omega;
    tic(COPY_CLOCK);
#ifdef PAR_PC
    parallel_copy(N, Ax, x);
#else
    memcpy(Ax, x, N * sizeof(double));
#endif
    t_copy += toc(COPY_CLOCK);
    tic(FW_SSOR_CLOCK);
#ifdef PAR_PC

        parallel_ssor_forward_sweep(n, 0, n, 0, n, 0, n, Ax, w); // --1--

#else
...

This is the function parallel_ssor_forward_sweep:

void parallel_ssor_forward_sweep(int n, int i1, int i2, int j1, int j2, int k1, int k2, double *restrict Ax, double w)
{
    char *dep_matrix = (char *)malloc(sizeof(char) * (i2 - i1) / BS * (j2 - j1) / BS * (k2 - k1) / BS);
    // --2--
    for (int k = 0; k < (k2 - k1) / BS; k++)
    {
        for (int j = 0; j < (j2 - j1) / BS; j++)
        {
            for (int i = 0; i < (i2 - i1) / BS; i++)
            {
                // ssor_forward_sweep_pwrap(n, i1 + i * BS, i1 + (i + 1) * BS, j1 + j * BS, j1 + (j + 1) * BS, k1 + k * BS, k1 + (k + 1) * BS, Ax, w, dep_matrix, i, j, k);
                ssor_forward_sweep_pwrap(n, i1, i2, j1, j2, k1, k2, Ax, w, dep_matrix, i, j, k);
            }
        }
    }
    free(dep_matrix);
    
}

The function ssor_forward_sweep_pwrap actually creates the OpenMP task, I'll attach the code:

void ssor_forward_sweep_pwrap(int n, int i1, int i2, int j1, int j2, int k1, int k2, double *restrict Ax, double w, char *dep_matrix, int i, int j, int k)
{
#define dep_mat(i, j, k) (dep_matrix[(k * (j2 - j1) + j) * (i2 - i1) + i])

    char *top_dep = k - 1 >= 0 ? &dep_mat(i, j, k - 1) : NULL;
    char *left_dep = j - 1 >= 0 ? &dep_mat(i, j - 1, k) : NULL;
    char *back_dep = i - 1 >= 0 ? &dep_mat(i - 1, j, k) : NULL;
    char *out_dep = &dep_mat(i, j, k);

#pragma omp task depend(in                                          \
                        : *top_dep, *left_dep, *back_dep) depend(inout \
                                                              : *out_dep)

    {
        ssor_forward_sweep(n, i1 + i * BS, i1 + (i + 1) * BS, j1 + j * BS, j1 + (j + 1) * BS, k1 + k * BS, k1 + (k + 1) * BS, Ax, w);
    }
#undef dep_mat
}

Pragma directive:

#pragma omp parallel
#pragma omp single
    {
    }

Now the problem is that if I put the directive above to make the code parallel around the parallel_ssor_forward_sweep call (marked with the comment --1-- in the code) I get much better times (around 9.6/9.7 seconds for that code section) vs if I put it around the for marked in the code with --2-- comment, so getting the whole function code but the malloc (times 12.7/12.8 seconds).

I've executed the code 3 times for each to make sure it wasn't a fluke and with the same number of threads (6 in this case).

I'm running on my university machine which supposedly should not have any other program running at the same time as mine for the allocated resources.

The reason why I believe this behaviour is strange is that inside the pragma omp single region I expect one single thread to execute the code, so I don't think having the malloc inside or outside the region should lead to such a difference.

Also the code gives the same result and is run with the same input.

Solution

The problem, as mentioned by @JohnBollinger in the comments, was due to the fact that when I put the pragma as below, the thread that executed the #pragma omp single region could get to the free(dep_matrix) before the other threads had ended.

#ifdef PAR_PC
#pragma omp parallel
#pragma omp single
    {
        parallel_ssor_forward_sweep(n, 0, n, 0, n, 0, n, Ax, w);
    }
#else

By putting pragma inside the parallel_ssor_forward_sweep function like below, the thread executing the #pragma omp single region has to wait for the other threads at the end of the region (i.e. before the free(dep_matrix)), and so the matrix is still available when the other threads need it.

void parallel_ssor_forward_sweep(int n, int i1, int i2, int j1, int j2, int k1, int k2, double *restrict Ax, double w)
{
    char *dep_matrix = (char *)malloc(sizeof(char) * (i2 - i1) / BS * (j2 - j1) / BS * (k2 - k1) / BS);
#pragma omp parallel
#pragma omp single
    {
        for (int k = 0; k < (k2 - k1) / BS; k++)
        {
            for (int j = 0; j < (j2 - j1) / BS; j++)
            {
                for (int i = 0; i < (i2 - i1) / BS; i++)
                {
                    // ssor_forward_sweep_pwrap(n, i1 + i * BS, i1 + (i + 1) * BS, j1 + j * BS, j1 + (j + 1) * BS, k1 + k * BS, k1 + (k + 1) * BS, Ax, w, dep_matrix, i, j, k);
                    ssor_forward_sweep_pwrap(n, i1, i2, j1, j2, k1, k2, Ax, w, dep_matrix, i, j, k);
                }
            }
        }
    }
    free(dep_matrix);
}