Search code examples
performancegpucompiler-optimizationopenaccpgcc

OpenAcc program built with C++ compiler is way slower than C built version


The code I'm working on is in C++ and is slightly complicated but the the example below shows the problem. It comes from a book by Chandrasekaran and Juckeland. If it is compiled with nvc -acc (or pgcc -acc, as the authors did) and ran, it takes a few seconds to finish. If I use nvc++ -acc (pgc++ -acc), it is orders of magnitude slower, being even worse than the serial version. I'm curious if anyone noticed a similar issue or knows a possible explanation.

#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <sys/time.h>
#define WIDTH 1000
#define HEIGHT 1000
#define TEMP_TOLERANCE 0.01
double Temperature[HEIGHT+2][WIDTH+2];
double Temperature_previous[HEIGHT+2][WIDTH+2];
void initialize();
void track_progress(int iter);

int main(int argc, char *argv[]) {
    int i, j;
    int iteration=1;
    double worst_dt=100;
    struct timeval start_time, stop_time, elapsed_time;
    gettimeofday(&start_time,NULL);
    initialize();

#pragma acc data copy(Temperature_previous), create(Temperature)    
{
    while ( worst_dt > TEMP_TOLERANCE ) {

#pragma acc kernels
        for(i = 1; i <= HEIGHT; i++) {
            for(j = 1; j <= WIDTH; j++) {
                Temperature[i][j] = 0.25 * (Temperature_previous[i+1][j]
                        + Temperature_previous[i-1][j]
                        + Temperature_previous[i][j+1]
                        + Temperature_previous[i][j-1]);
            }
        }
        worst_dt = 0.0;

#pragma acc kernels 
        for(i = 1; i <= HEIGHT; i++){
            for(j = 1; j <= WIDTH; j++){
                worst_dt = fmax( fabs(Temperature[i][j]-
                            Temperature_previous[i][j]),worst_dt);
                Temperature_previous[i][j] = Temperature[i][j];
            }
        }

        if((iteration % 100) == 0) {
#pragma acc update host(Temperature)
            track_progress(iteration);
        }
        iteration++;
    }
}

    gettimeofday(&stop_time,NULL);
    timersub(&stop_time, &start_time, &elapsed_time);
    printf("\nMax error at iteration %d was %f\n",
            iteration-1, worst_dt);
    printf("Total time was %f seconds.\n",
            elapsed_time.tv_sec+elapsed_time.tv_usec/1000000.0);
}
void initialize(){
    int i,j;
    for(i = 0; i <= HEIGHT+1; i++){
        for (j = 0; j <= WIDTH+1; j++){
            Temperature_previous[i][j] = 0.0;
        }
    }
    for(i = 0; i <= HEIGHT+1; i++) {
        Temperature_previous[i][0] = 0.0;
        Temperature_previous[i][WIDTH+1] = (100.0/HEIGHT)*i;
    }
    for(j = 0; j <= WIDTH+1; j++) {
        Temperature_previous[0][j] = 0.0;
        Temperature_previous[HEIGHT+1][j] = (100.0/WIDTH)*j;
    }
}
void track_progress(int iteration) {
    int i;
    printf("---------- Iteration number: %d ------------\n",
            iteration);
    for(i = HEIGHT-5; i <= HEIGHT; i++) {
        printf("[%d,%d]: %5.2f ", i, i, Temperature[i][i]);
    }
    printf("\n");
}

I tested it on two Linux machines with the same result. I tried various compilers, but as long as I was compiling with respect to C++ standards, not C, the problem remained. -Minfo=all doesn't show anything of qualitative significance.


Solution

  • I'm seeing the same thing and if you add -Minfo=accel into the command line, you'll see that it stops parallelizing the loop at line 37, saying that the value of worst_dt is needed later one, causing a dependency. I don't know why the compiler analysis isn't working correctly here, but if you change line 36 to the following you'll get the performance back: #pragma acc kernels loop independent collapse(2) reduction(max:worst_dt).

    Update: You can also try adding -std=c++11 or -std=c++14 to your compilation and get the expected performance without modifying the code. I am not fully sure why the default standard has this issue, but the compiler team has been made aware of this exchange.