Search code examples
vectorizationsimdllvm-clangmicro-optimizationavx2

Is it possible to convince clang to auto-vectorize this code without using intrinsics?


Imagine I have this naive function to detect sphere overlap. The point of this question is not really to discuss the best way to do hit testing on spheres, so this is just for illustration.

inline bool sphere_hit(float x1, float y1, float z1, float r1,
        float x2, float y2, float z2, float r2) {
    float xd = (x1 - x2);
    float yd = (y1 - y2);
    float zd = (z1 - z2);

    float max_dist = (r1 + r2);

    return xd * xd + yd * yd + zd * zd < max_dist * max_dist;
}

And I call it in a nested loop, as follows:

std::vector<float> xs, ys, zs, rs;
int n_spheres;
// <snip>
int n_hits = 0;
for (int i = 0; i < n_spheres; ++i) {
    for (int j = i + 1; j < n_spheres; ++j) {
        if (sphere_hit(xs[i], ys[i], zs[i], rs[i],
                xs[j], ys[j], zs[j], rs[j])) {
            ++n_hits;
        }
    }
}
std::printf("total hits: %d\n", n_hits);

Now, clang (with -O3 -march=native) is smart enough to figure out how to vectorize (and unroll) this loop into 256-bit avx2 instructions. Awesome!

However, if I do anything more complicated than increment the number of hits, for example calling some arbitrary function handle_hit(i, j), clang instead emits a naive scalar version.

Hits should be very rare, so what I think should happen is checking on every vectorized loop iteration if the value is true for any of the lanes, and jumping to some scalar slow path if so. This should be possible with vcmpltps followed by vmovmskps. However, I can't get clang to emit this code, even if I surround the call to sphere_hit with __builtin_expect(..., 0).


Solution

  • Indeed it is possible to convince clang to vectorize this code. With compiler options -Rpass-analysis=loop-vectorize -Rpass=loop-vectorize -Rpass-missed=loop-vectorize, clang claims that the floating point operations are vectorized, which is confirmed by the Godbolt output. (The red underlined fors are not errors, but vectorization reports).

    Vectorization is possible by storing the results of sphere_hit as chars to a temporary array hitx8. Afterwards, 8 sphere_hit results are tested per iteration by reading the 8 chars back from memory as one uint64_t a. This should be quite efficient since the condition a!=0 (see code below) is still rare since sphere hits are very rare. Moreover, array hitx8 is likely in L1 or L2 cache most of the time.

    I didn't test the code for correctness, but at least the auto-vectorization idea should work.

    /* clang -Ofast -Wall -march=broadwell -Rpass-analysis=loop-vectorize -Rpass=loop-vectorize -Rpass-missed=loop-vectorize */
    #include<string.h>
    char sphere_hit(float x1, float y1, float z1, float r1,
            float x2, float y2, float z2, float r2);
    void handle_hit(int i, int j);
    
    void vectorized_code(float* __restrict xs, float* __restrict ys, float* __restrict zs, float* __restrict rs, char* __restrict hitx8, int n_spheres){
        unsigned long long int a;
        for (int i = 0; i < n_spheres; ++i) {
            for (int j = i + 1; j < n_spheres; ++j){
                /* Store the boolean results temporarily in char array hitx8.     */
                /* The indices of hitx8 are shifted by i+1, so the loop           */
                /* starts with hitx8[0]                                           */
                /* char array hitx8 should have n_spheres + 8 elements            */
                hitx8[j-i-1] = sphere_hit(xs[i], ys[i], zs[i], rs[i],
                        xs[j], ys[j], zs[j], rs[j]);
            }
            for (int j = n_spheres; j < n_spheres+8; ++j){
                /* Add 8 extra zeros at the end of hitx8.                   */
                hitx8[j-i-1] = 0;     /* hitx8 is 8 elements longer than xs */
            }
            for (int j = i + 1; j < n_spheres; j=j+8){
                memcpy(&a,&hitx8[j-i-1],8);
                /* Check 8 sphere hits in parallel:                                   */
                /* one `unsigned long long int a` contains 8 boolean values here      */ 
                /* The condition a!=0 is still rare since sphere hits are very rare.  */
                if (a!=0ull){ 
                    if (hitx8[j-i-1+0] != 0) handle_hit(i,j+0);
                    if (hitx8[j-i-1+1] != 0) handle_hit(i,j+1);
                    if (hitx8[j-i-1+2] != 0) handle_hit(i,j+2);
                    if (hitx8[j-i-1+3] != 0) handle_hit(i,j+3);
                    if (hitx8[j-i-1+4] != 0) handle_hit(i,j+4);
                    if (hitx8[j-i-1+5] != 0) handle_hit(i,j+5);
                    if (hitx8[j-i-1+6] != 0) handle_hit(i,j+6);
                    if (hitx8[j-i-1+7] != 0) handle_hit(i,j+7);
                }
            }
        }
    }
    
    
    inline char sphere_hit(float x1, float y1, float z1, float r1,
            float x2, float y2, float z2, float r2) {
        float xd = (x1 - x2);
        float yd = (y1 - y2);
        float zd = (z1 - z2);
    
        float max_dist = (r1 + r2);
    
        return xd * xd + yd * yd + zd * zd < max_dist * max_dist;
    }