How to maximise instruction level parallelism of sqrt-heavy-loop on skylake architecture?

To introduced myself to x86 intrinsics (and cache friendliness to a lesser extent) I explicitly vectorized a bit of code I use for RBF (radial basis function) -based grid deformation. Having found vsqrtpd to be the major bottleneck I want to know if/how I can mask its latency further. This is the scalar computational kernel:

for(size_t i=0; i<nPt; ++i)
{
    double xi = X[i], yi = X[i+nPt], zi = X[i+2*nPt];

   for(size_t j=0; j<nCP; ++j)
   {
        // compute distance from i to j
        double d = sqrt(pow(xi-Xcp[   j   ],2)+
                        pow(yi-Xcp[ j+nCP ],2)+
                        pow(zi-Xcp[j+2*nCP],2));

        // compute the RBF kernel coefficient
        double t = max(0.0,1.0-d);
        t = pow(t*t,2)*(1.0+4.0*d);

        // update coordinates
        for(size_t k=0; k<nDim; ++k) X[i+k*nPt] += t*Ucp[j+k*nCP];
    }
}

nPt is the number of target coordinates and it is much larger than nCP the number of source coordinates/displacements. The latter fit in L3 and so the inner-most loop is always over source points.

First optimization step was to work on 4 target points simultaneously. Source point data was still accessed via scalar loads followed by broadcast.
Second step was to target L1 by blocking the loops, blocking the i-loop was somehow much more important than blocking the j-loop, which gave only a marginal improvement. Inner-most loop is still over j to reduce load/stores.
Third was to load 4 control points and use shuffle/permute to go over the 4 combination of i-j instead of using broadcast.
Fourth, after observing that omitting the square root gives a 1.5x speed up (to about 70% the FP performance of a large LLT on an i7-7700), was to dedicate 4 registers to the computation of the 4 square roots to (maybe?) allow some other computation to take place... 1% improvement vs third step.

Current code

void deform(size_t nPt, size_t nCP, const double* Xcp, const double* Ucp, double* X)
{
    const size_t SIMDLEN = 4;

    // tile ("cache block") sizes
    const size_t TILEH = 512;
    const size_t TILEW = 256;

    // fill two registers with the constants we need
    __m256d vone  = _mm256_set1_pd(1.0),
            vfour = _mm256_set1_pd(4.0);

    // explicitly vectorized (multiple i's at a time) and blocked
    // outer most loop over sets of #TILEH points
    for(size_t i0=0; i0<nPt; i0+=TILEH)
    {
        // displacement buffer, due to tiling, coordinates cannot be modified in-place
        alignas(64) double U[3*TILEH*sizeof(double)];

        // zero the tile displacements
        for(size_t k=0; k<3*TILEH; k+=SIMDLEN)
            _mm256_store_pd(&U[k], _mm256_setzero_pd());

        // stop point for inner i loop
        size_t iend = min(i0+TILEH,nPt);

        // second loop over sets of #TILEW control points
        for(size_t j0=0; j0<nCP; j0+=TILEW)
        {
            // stop point for inner j loop
            size_t jend = min(j0+TILEW,nCP);

            // inner i loop, over #TILEH points
            // vectorized, operate on #SIMDLEN points at a time
            for(size_t i=i0; i<iend; i+=SIMDLEN)
            {
                // coordinates and displacements of points i
                __m256d wi,
                xi = _mm256_load_pd(&X[   i   ]),
                yi = _mm256_load_pd(&X[ i+nPt ]),
                zi = _mm256_load_pd(&X[i+2*nPt]),
                ui = _mm256_load_pd(&U[    i-i0    ]),
                vi = _mm256_load_pd(&U[ i-i0+TILEH ]);
                wi = _mm256_load_pd(&U[i-i0+2*TILEH]);

                // inner j loop, over #TILEW control points, vectorized loads
                for(size_t j=j0; j<jend; j+=SIMDLEN)
                {
                    // coordinates of points j, and an aux var
                    __m256d t,
                    xj = _mm256_load_pd(&Xcp[   j   ]),
                    yj = _mm256_load_pd(&Xcp[ j+nCP ]),
                    zj = _mm256_load_pd(&Xcp[j+2*nCP]);

                    // compute the possible 4 distances from i to j...
                    #define COMPUTE_DIST(D) __m256d                         \
                    D = _mm256_sub_pd(xi,xj);  D = _mm256_mul_pd(D,D);      \
                    t = _mm256_sub_pd(yi,yj);  D = _mm256_fmadd_pd(t,t,D);  \
                    t = _mm256_sub_pd(zi,zj);  D = _mm256_fmadd_pd(t,t,D);  \
                    D = _mm256_sqrt_pd(D)

                    // ...by going through the different permutations
                    #define SHUFFLE(FUN,IMM8)   \
                    xj = FUN(xj,xj,IMM8);       \
                    yj = FUN(yj,yj,IMM8);       \
                    zj = FUN(zj,zj,IMM8)

                    COMPUTE_DIST(d0);

                    SHUFFLE(_mm256_shuffle_pd,0b0101);
                    COMPUTE_DIST(d1);

                    SHUFFLE(_mm256_permute2f128_pd,1);
                    COMPUTE_DIST(d2);

                    SHUFFLE(_mm256_shuffle_pd,0b0101);
                    COMPUTE_DIST(d3);

                    // coordinate registers now hold the displacements
                    xj = _mm256_load_pd(&Ucp[   j   ]),
                    yj = _mm256_load_pd(&Ucp[ j+nCP ]);
                    zj = _mm256_load_pd(&Ucp[j+2*nCP]);

                    // coefficients for each set of distances...
                    #define COMPUTE_COEFF(C)                                \
                    t = _mm256_min_pd(vone,C);  t = _mm256_sub_pd(vone,t);  \
                    t = _mm256_mul_pd(t,t);     t = _mm256_mul_pd(t,t);     \
                    C = _mm256_fmadd_pd(vfour,C,vone);                      \
                    C = _mm256_mul_pd(t,C)

                    // ...+ update i point displacements
                    #define UPDATE_DISP(C)          \
                    COMPUTE_COEFF(C);               \
                    ui = _mm256_fmadd_pd(C,xj,ui);  \
                    vi = _mm256_fmadd_pd(C,yj,vi);  \
                    wi = _mm256_fmadd_pd(C,zj,wi)

                    UPDATE_DISP(d0);

                    SHUFFLE(_mm256_shuffle_pd,0b0101);
                    UPDATE_DISP(d1);

                    SHUFFLE(_mm256_permute2f128_pd,1);
                    UPDATE_DISP(d2);

                    SHUFFLE(_mm256_shuffle_pd,0b0101);
                    UPDATE_DISP(d3);
                }

                // store updated displacements
                _mm256_store_pd(&U[    i-i0    ], ui);
                _mm256_store_pd(&U[ i-i0+TILEH ], vi);
                _mm256_store_pd(&U[i-i0+2*TILEH], wi);
            }
        }

        // add tile displacements to the coordinates
        for(size_t k=0; k<3; ++k)
        {
            for(size_t i=i0; i<iend; i+=SIMDLEN)
            {
                __m256d
                x = _mm256_load_pd(&X[i+k*nPt]),
                u = _mm256_load_pd(&U[i-i0+k*TILEH]);
                x = _mm256_add_pd(x,u);
                _mm256_stream_pd(&X[i+k*nPt], x);
            }
        }
    }
}

So what more can I do to it? Or, am I doing something very wrong?

Thank you, P. Gomes

Solution

First check perf counters for arith.divider_active being ~= core clock cycles.

98% of the function runtime can be explained by taking the number of square roots and the operation throughput.

Or that works too.

If that's the case, you're saturating the (not fully pipelined) divider throughput and there's not much left to gain from just exposing more ILP.

Algorithmic changes are your only real chance to gain anything, e.g avoid some sqrt operations or use single-precision.

Single-precision gives you 2x as much work per vector for free. But for sqrt-heavy workloads there's an additional gain: vsqrtps throughput per vector is usually better than vsqrtpd. That's the case on Skylake: one per 6 cycles vs. vsqrtpd at one per 9 to 12 cycles. That could move the bottleneck away from the sqrt/divide unit, perhaps to the front-end or the FMA unit.

vrsqrtps has been suggested in comments. That would be worth considering (if single-precision is an option), but it's not a clear win when you need a Newton Raphson iteration to get enough precision. Bare x * rsqrtps(x) without Newton Raphson is probably too inaccurate (and needs a cmp/AND to work around x==0.0), but an NR iteration can take too many extra FMA uops to be worth it.

(AVX512 with vrsqrt14ps/pd has more precision in the approximation, but usually still not enough to use without Newton. But interestingly it does exist for double-precision. Of course if you're on Xeon Phi, sqrt is very slow and you're intended to use AVX512ER vrsqrt28pd + Newton, or just vrsqrt28ps on its own.)

Last time I tuned a function including a sqrt of a polynomial-approximation for Skylake, fast-approx reciprocals weren't worth it. Hardware single-precision sqrt was the best choice that gave us the required precision (and we weren't even considering needing double). There was more work than yours between sqrt operations, though.