I'm using Intel Cilk Plus array notation to practice vector programming. However, I met strange copy behaviors of array assignments.
The problem to solve is parallel prefix. D is input, and P is output.
//Wrong result code. Tested on Intel icc 14.0.2.
void vec_prefix(int n, int D[n], int P[n]) {
P[0:n] = D[0:n]; //initial copy
int bound=1;
int * Buf1 = _mm_malloc(n*sizeof(int), 16);
int * Buf2 = _mm_malloc(n*sizeof(int), 16);
while(bound<n){
Buf1[0:n-bound] = P[0:n-bound];
Buf2[0:n-bound] = P[bound:n-bound];
//printf("WHY??\n"); //Add this fence, the result will be correct.
P[bound:n-bound] = Buf1[0:n-bound] + Buf2[0:n-bound];
bound<<=1;
}
_mm_free(Buf1); _mm_free(Buf2);
}
If I remove the comment of the line "printf", the result is correct. Otherwise, wrong. But if all copies follow the descriptions in the Intel's document, the code should be correct.
It seems if there is no such as memory fence, the first two lines' Buf1/Buf2 copy are not finished, and the add operation later uses some unstable value. Or, the compiler optimizer just use copy-propagation to remove the copy, and create "P[bound:n-bound] = P[0:n-bound] + P[bound:n-bound]". This is undefined in Intel's document.
//Correct result code
void vec_prefix(int n, int D[n], int P[n]) {
P[0:n] = D[0:n]; //initial copy
int bound=1;
int * Buf1 = _mm_malloc(n*sizeof(int), 16);
while(bound<n){
//direct copy part
Buf1[0:bound] = P[0:bound];
//add part
Buf1[bound:n-bound] = P[bound:n-bound] + P[0:n-bound];
//copy back
P[0:n] = Buf1[0:n];
bound<<=1;
}
_mm_free(Buf1);
}
Ref: Intel's document
Your code looks correct to me. No "fence" should be required. I filed an internal bug report for it with the Intel compiler group with an example of all ones as input. Thanks for the example.
If the compiler were working correctly, you could even shorten the while loop to:
while(bound<n){
Buf1[0:n-bound] = P[0:n-bound];
P[bound:n-bound] = Buf1[0:n-bound] + P[bound:n-bound];
bound<<=1;
}
The second array section assignment is okay since the overlap for P is exactly. Alas icc exhibits its bug for this example too.