Search code examples
programming-languagesparallel-processing

openmp generates large overhead in kernel32.dll(SleepEx)


I'm doing a project about image processing using openmp. I have a simple code as follows. The program ran smoothly on my linux platform with gcc4.3.3. But the program ran incredibly slow on xp platform(visual studio 2005 with intel compiler v11). After some analysis, the bottleneck was SleepEx in kernel32.dll

is my openmp(vc 2005) older than that of gcc4.3.3 ?

unsigned char   **a_data,
                **b_data,
                **c_data,
                *p,
                *p_a,
                *p_b,
                *p_c;
unsigned long   nr,
                nc;
nr = nc = 64;

a_data = (unsigned char **) malloc(nr*sizeof(unsigned char *));
p = (unsigned char *) malloc(nr*nc*sizeof(unsigned char));
for(int i=0; i<nr; i++)
{
    a_data[i] = p + i*nr;
}
b_data = (unsigned char **) malloc(nr*sizeof(unsigned char *));
p = (unsigned char *) malloc(nr*nc*sizeof(unsigned char));
for(int i=0; i<nr; i++)
{
    b_data[i] = p + i*nr;
}
c_data = (unsigned char **) malloc(nr*sizeof(unsigned char *));
p = (unsigned char *) malloc(nr*nc*sizeof(unsigned char));
for(int i=0; i<nr; i++)
{
    c_data[i] = p + i*nr;
}

for(int i=0; i<nr; i++)
{
    p_a = a_data[i];
    p_b = b_data[i];
    p_c = c_data[i];
#pragma omp parallel for
    for(int j=0; j<nc; j++)
    {
        p_a[j] = p_b[j] + p_c[j];
    }
}

Solution

  • If I understand correctly SleepEx is used to suspend a thread pending some condition -- which suggests that time spent in SleepEx is time a thread is not doing anything useful. This in turn suggests poor load-balancing or contention for access to shared variables, or some other consequence of parallelisation.

    Before jumping to the conclusion that there is something 'wrong' with XP (which may be correct, but you haven't convinced me) you should:

    a) Experiment with parallelising the outer loop (for(int i=0; i<nr; i++)) rather than the inner one. Play around with loop scheduling. Try collapsing the loops into one.

    b) Be explicit about which variables are shared and which are private. I write Fortran and can't recall what the defaults for C are, only that they are subtly different. Your temporary variables p_a, p_b, p_c may be shared but it doesn't look as if they need to be.

    c) Figure out the memory access patterns of your program, make sure that they make good use of the cache.