Re. optimization: Does x += y at the center of a loop always cause a read after write data dependency and so prevent vectorization?

My question is:

Re. optimization: Does x += y at the center of a loop always cause a read after write data dependency and so prevent vectorization?

See https://cvw.cac.cornell.edu/vector/coding_dependencies

Read after write ("flow" or "RAW") dependency This kind of dependency is not vectorizable. It occurs when the values of variables involved in a particular loop iteration (the "read") are determined in a previous loop iteration (the "write"). In other words, a variable is read (used as an operand to a mathematical operation) after its value has been modified by a previous loop iteration.

This question is very general in that it is basically asking whether using the += operator in the center of a loop precludes vectorization by causing a read after write ("flow" or "RAW") data dependency.

Eg.

for(i...){

   for(j...){

     x(i,j) += y(i,j)

   } 

}

See https://gcc.gnu.org/projects/tree-ssa/vectorization.html Example 14: Double reduction:

Solution

In your example, there is no issue with data dependencies assuming the array does not alias each other: the loop can be easily and safely vectorized since each x(i,j) value is only dependent of y(i,j) and y(i,j) is only read in the loop. If x and y are overlapping, then it is much harder to vectorize the loop since it causes an implicit dependency between the computation of x(i,j) value (because y(i,j) can alias with x(i-1,j) which is computed just before).

In general, automatic vectorization is very hard if not even possible when there is a dependency chain (since the sequential order is mainly the only possible one). The provided article mention such a case (where each operation require the previous one to be computed due to a data dependency). This has nothing to do with the specific += operator. It is all about data dependencies (or any loop carried dependency).

Notes

Regarding floating-point (FP) numbers, operations are no associative by default, so there is no way to execute something like x(0) + x(1) + x(2) + ... in a vectorized way (if x is an array of FP numbers). Indeed, the only possible correct order is ((x(0) + x(1)) + x(2)) + ... based on the IEEE-754 standard. Compiler options can be used to consider FP operations as associative so to be able to execute this kind of computation in a vectorized way. This breaks the IEEE-754 standard and some codes requiring the ordering not to be changed (eg. Kahan summation). Also please note that while the above expression can be vectorized (see parallel scan), it is generally not very efficient on most mainstream CPU yet. Still there are hardware that can benefit from parallel scan: mainstream discrete GPUs.