I'm playing a bit with auto parallelization in ICC (11.1; old, but can't do anything about it) and I'm wondering why the compiler can't parallelize the inner loop for a simple gaussian elimination:
void makeTriangular(float **matrix, float *vector, int n) {
for (int pivot = 0; pivot < n - 1; pivot++) {
// swap row so that the row with the largest value is
// at pivot position for numerical stability
int swapPos = findPivot(matrix, pivot, n);
std::swap(matrix[pivot], matrix[swapPos]);
std::swap(vector[pivot], vector[swapPos]);
float pivotVal = matrix[pivot][pivot];
for (int row = pivot + 1; row < n; row++) { // line 72; should be parallelized
float tmp = matrix[row][pivot] / pivotVal;
for (int col = pivot + 1; col < n; col++) { // line 74
matrix[row][col] -= matrix[pivot][col] * tmp;
}
vector[row] -= vector[pivot] * tmp;
}
}
}
We're only writing to the arrays dependent on the private row (and col) variable and row is guaranteed to be larger than pivot, so it should be obvious to the compiler that we aren't overwriting anything.
I'm compiling with -O3 -fno-alias -parallel -par-report3
and get lots of dependencies ala: assumed FLOW dependence between matrix line 75 and matrix line 73.
or assumed ANTI dependence between matrix line 73 and matrix line 75.
and the same for line 75 alone. What problem does the compiler have? Obviously I could tell it exactly what to do with some pragmas, but I want to understand what the compiler can get alone.
Basically the compiler can't figure out that there's no dependency due to the name matrix
and the name vector
being both read from and written too (even though with different regions). You might be able to get around this in the following fashion (though slightly dirty):
void makeTriangular(float **matrix, float *vector, int n)
{
for (int pivot = 0; pivot < n - 1; pivot++)
{
// swap row so that the row with the largest value is
// at pivot position for numerical stability
int swapPos = findPivot(matrix, pivot, n);
std::swap(matrix[pivot], matrix[swapPos]);
std::swap(vector[pivot], vector[swapPos]);
float pivotVal = matrix[pivot][pivot];
float **matrixForWriting = matrix; // COPY THE POINTER
float *vectorForWriting = vector; // COPY THE POINTER
// (then parallelize this next for loop as you were)
for (int row = pivot + 1; row < n; row++) {
float tmp = matrix[row][pivot] / pivotVal;
for (int col = pivot + 1; col < n; col++) {
// WRITE TO THE matrixForWriting VERSION
matrixForWriting[row][col] = matrix[row][col] - matrix[pivot][col] * tmp;
}
// WRITE TO THE vectorForWriting VERSION
vectorForWriting[row] = vector[row] - vector[pivot] * tmp;
}
}
}
Bottom line is just give the ones you're writing to a temporarily different name to trick the compiler. I know that it's a little dirty and I wouldn't recommend this kind of programming in general. But if you're sure that you have no data dependency, it's perfectly fine.
In fact, I'd put some comments around it that are very clear to future people who see this code that this was a workaround, and why you did it.
Edit: I think the answer was basically touched on by @FPK and an answer was posted by @Evgeny Kluev. However, in @Evgeny Kluev's answer he suggests making this an input parameter and that might parallelize but won't give the correct value since the entries in matrix
won't be updated. I think the code I posted above will give the correct answer too.