AVX-512 and Branching

I'm confused as to what masking can do in theory in relation to branches. Let's say I have a Skylake-SP (ha, I wish..), and we're ignoring compiler capabilities, just what's possible in theory:

If a branch conditional is dependant on a static flag, and all branches set an array to a computational result, assuming the compiler does not optimize this to two separate loops anyways, can it vectorize?

do i = 1, nx
  if (my_flag .eq. 0) then
    a(i) = b(i) ** 2
  else
    a(i) = b(i) ** 3
  end if
end do

If only as subset of the branches are setting the value in question, can it vectorize?

do i = 1, nx
  if (my_flag .eq. 0) then
    a(i) = b(i) ** 2
  end if
end do

If a branch conditional is in itself dependent on vector data, can it vectorize?

do i = 1, nx
  if (c(i) > 0) then
    a(i) = b(i) ** 2
  else
    a(i) = b(i) ** 3
  end if
end do

Solution

Note: This answer mostly discusses a very specific memory-access issue when it comes to vectorization and it applies mostly at a conceptual level to transforming a series of scalar accesses to arrays into vectorized accesses without assuming anything about what portions of the underlying arrays are mapped. In languages like Fortran, the semantics of the language itself may guarantee that arrays are contiguously mapped, or bounds-checks before entering the loop might be enough to avoid the problem mentioned below.

This answer shouldn't be seen as a good treatment of vectorization in general and certainly not in Fortran specifically. A more comprehensive treatment of vectorization issues appears in another answer, which also specifically addresses AVX-512.

One often overlooked issue with vectorizing conditions is that compilers can vectorize conditional loops of the type you are interested in, via blending or other element-wise predication techniques, only if they can prove that the vectorization accesses the same elements as are accessed as in the scalar element-by-element implementation. If the instruction set doesn't offer an element-wise way to do vector loads respecting this condition, or if the compiler is unable to use them, this can effectively block vectorization.

Said another way, compilers can generally only fully vectorize with plain vector loads if all paths through the loop body access the same elements.

The underlying reason is that the compiled code must not access elements that aren't accessed by the semantics of the original code, even if they are later "blended away" since doing so might cause a fault! If the instruction set doesn't provide instructions to conditionally access elements in memory and suppress faults from not-selected elements, this is a significant barrier to optimization.

In the examples you gave this means that (1) and (3) could be vectorized "without hoisting the condition" while (2) could not, since (2) accesses a[i] and b[i] only in the if body, but not if the if isn't executed. Of course, a real compiler would just hoist a trivial flag check out of the loop and just not execute the loop at all in the myflag == false case, so it's not really a good example.

Let's just look at a couple of cases that subsumes all your examples. First, we need a flag that cannot be hoisted - let's just use an array of bool values. So an interesting somewhat general loop with an output array a, two input arrays b and c and a flag array f might look something like:

do i = 1, nx
  if (f(i) > 0) then
    a(i) = g(b(i), c(i));
  else
    a(i) = h(b(i), c(i));
  end if
end do

Depending on the flag f(i) corresponding to each element, we apply either the function g or h to the input elements b(i) and c(i). Per my condition above we can vectorize only if both g and h actually access the same elements of b and c.

Let's move on to two real work examples of the above:

void example1(bool* f, int* __restrict__ a, int* __restrict__ b, int* __restrict__ c, size_t n) {
    for (size_t i = 0; i < n; i++) {
        if (f[i]) {
            a[i] = b[i];
        } else {
            a[i] = c[i];
        }
    }
}

void example2(bool* f, int* __restrict__ a, int* __restrict__ b, int* __restrict__ c, size_t n) {
    for (size_t i = 0; i < n; i++) {
        if (f[i]) {
            a[i] = b[i] + c[i] ;
        } else {
            a[i] = b[i] - c[i] * 2 + 1 ;
        }
    }
}

Both have the same basic form, but which is harder to vectorize? The first is a simple direct assignment of either b[i] or c[i] depending on the flag. The second is a more complex function of both b[i] and c[i] which are significantly different on both paths.

Well the second is much easier to vectorize since it accesses b[i] and c[i] unconditionally. In fact, gcc doesn't manage to vectorize either one for some reason. clang only vectorizes the second. Somewhat surprisingly icc manages to vectorize both - since it is smart enough to use vpmaskmovd which is a masked load that suppresses faults for unloaded elements.

You can examine the generated assembly on godbolt.

I had originally started this answer with the idea that accessing different array elements is currently an insurmountable barrier to vectorization for current compilers, but that's because I usually don't check icc. It's actually news to me that icc uses masked moves in this way. So the barrier is there, but at least some compilers can fault over it².

As the developer, you usually know that both arrays are fully accessible, such that it is safe to access all elements of b and c in the range [0, n) and it would be nice to communicate that to the compiler. I've tried adding unconditional dummy statements like b[i] = b[i]; c[i] = c[i]; or ... + c[i] * 0 which should compile to nothing but at least allow the compiler to see that semantically all elements are accessed. The do indeed "compile away" but the code-generation is not improved: additional vectorization doesn't occur. Probably they are already eliminated early in the compilation process before the vectorizaton analysis is done, so that information is lost to the vectorizer.

Other than the masked-move instructions, which aren't free and are not fully general, are there any other ways this situation could be improved? Well a compiler could take advantage of its knowledge of the platform memory protection model. For example, once any byte in a 4K page on x86 has been accessed, it is free to read all other bytes on that page. One could imagine a complicated implementation that started out in safe scalar code but as soon as a write to both arrays was "noticed" switched over to a vectorized loop for the remainder of the page.

Similar tricks could be played if the array accesses were aligned: the vectorized loop could check that if flag array was uniformly 0 or uniformly 1, if not it is safe to use the straightforward unconditional unmasked read implementation, otherwise it would fall back to the more careful implementation. Such a transformation would evidently only be profitable if the masks were rarely uniform, or almost always uniform³, and so are probably unlikely to be implemented in practice.

² At least if AVX is available: icc will still fail to vectorize the first example if you restrict it to pre-AVX instructions, since that's when vpmaskmovd/q and vmaskmovps/pd were introduced.

³ Since in that case if you've already determined the mask is uniform, you can implement the operation unconditionally by just doing the selected side of the if without any masking/blending based on whether it was uniform-0 or uniform-1. So you end up with three loops that internally implement: the all-zeros flag case, the all-ones flag case, and the mixed flag case, with jumps between them when the next vector of flags isn't the same as the current loop.