To introduced myself to x86 intrinsics (and cache friendliness to a lesser extent) I explicitly vectorized a bit of code I use for RBF (radial basis function) -based grid deformation. Having found vsqrtpd to be the major bottleneck I want to know if/how I can mask its latency further. This is the scalar computational kernel:
for(size_t i=0; i<nPt; ++i)
{
double xi = X[i], yi = X[i+nPt], zi = X[i+2*nPt];
for(size_t j=0; j<nCP; ++j)
{
// compute distance from i to j
double d = sqrt(pow(xi-Xcp[ j ],2)+
pow(yi-Xcp[ j+nCP ],2)+
pow(zi-Xcp[j+2*nCP],2));
// compute the RBF kernel coefficient
double t = max(0.0,1.0-d);
t = pow(t*t,2)*(1.0+4.0*d);
// update coordinates
for(size_t k=0; k<nDim; ++k) X[i+k*nPt] += t*Ucp[j+k*nCP];
}
}
nPt is the number of target coordinates and it is much larger than nCP the number of source coordinates/displacements. The latter fit in L3 and so the inner-most loop is always over source points.
void deform(size_t nPt, size_t nCP, const double* Xcp, const double* Ucp, double* X)
{
const size_t SIMDLEN = 4;
// tile ("cache block") sizes
const size_t TILEH = 512;
const size_t TILEW = 256;
// fill two registers with the constants we need
__m256d vone = _mm256_set1_pd(1.0),
vfour = _mm256_set1_pd(4.0);
// explicitly vectorized (multiple i's at a time) and blocked
// outer most loop over sets of #TILEH points
for(size_t i0=0; i0<nPt; i0+=TILEH)
{
// displacement buffer, due to tiling, coordinates cannot be modified in-place
alignas(64) double U[3*TILEH*sizeof(double)];
// zero the tile displacements
for(size_t k=0; k<3*TILEH; k+=SIMDLEN)
_mm256_store_pd(&U[k], _mm256_setzero_pd());
// stop point for inner i loop
size_t iend = min(i0+TILEH,nPt);
// second loop over sets of #TILEW control points
for(size_t j0=0; j0<nCP; j0+=TILEW)
{
// stop point for inner j loop
size_t jend = min(j0+TILEW,nCP);
// inner i loop, over #TILEH points
// vectorized, operate on #SIMDLEN points at a time
for(size_t i=i0; i<iend; i+=SIMDLEN)
{
// coordinates and displacements of points i
__m256d wi,
xi = _mm256_load_pd(&X[ i ]),
yi = _mm256_load_pd(&X[ i+nPt ]),
zi = _mm256_load_pd(&X[i+2*nPt]),
ui = _mm256_load_pd(&U[ i-i0 ]),
vi = _mm256_load_pd(&U[ i-i0+TILEH ]);
wi = _mm256_load_pd(&U[i-i0+2*TILEH]);
// inner j loop, over #TILEW control points, vectorized loads
for(size_t j=j0; j<jend; j+=SIMDLEN)
{
// coordinates of points j, and an aux var
__m256d t,
xj = _mm256_load_pd(&Xcp[ j ]),
yj = _mm256_load_pd(&Xcp[ j+nCP ]),
zj = _mm256_load_pd(&Xcp[j+2*nCP]);
// compute the possible 4 distances from i to j...
#define COMPUTE_DIST(D) __m256d \
D = _mm256_sub_pd(xi,xj); D = _mm256_mul_pd(D,D); \
t = _mm256_sub_pd(yi,yj); D = _mm256_fmadd_pd(t,t,D); \
t = _mm256_sub_pd(zi,zj); D = _mm256_fmadd_pd(t,t,D); \
D = _mm256_sqrt_pd(D)
// ...by going through the different permutations
#define SHUFFLE(FUN,IMM8) \
xj = FUN(xj,xj,IMM8); \
yj = FUN(yj,yj,IMM8); \
zj = FUN(zj,zj,IMM8)
COMPUTE_DIST(d0);
SHUFFLE(_mm256_shuffle_pd,0b0101);
COMPUTE_DIST(d1);
SHUFFLE(_mm256_permute2f128_pd,1);
COMPUTE_DIST(d2);
SHUFFLE(_mm256_shuffle_pd,0b0101);
COMPUTE_DIST(d3);
// coordinate registers now hold the displacements
xj = _mm256_load_pd(&Ucp[ j ]),
yj = _mm256_load_pd(&Ucp[ j+nCP ]);
zj = _mm256_load_pd(&Ucp[j+2*nCP]);
// coefficients for each set of distances...
#define COMPUTE_COEFF(C) \
t = _mm256_min_pd(vone,C); t = _mm256_sub_pd(vone,t); \
t = _mm256_mul_pd(t,t); t = _mm256_mul_pd(t,t); \
C = _mm256_fmadd_pd(vfour,C,vone); \
C = _mm256_mul_pd(t,C)
// ...+ update i point displacements
#define UPDATE_DISP(C) \
COMPUTE_COEFF(C); \
ui = _mm256_fmadd_pd(C,xj,ui); \
vi = _mm256_fmadd_pd(C,yj,vi); \
wi = _mm256_fmadd_pd(C,zj,wi)
UPDATE_DISP(d0);
SHUFFLE(_mm256_shuffle_pd,0b0101);
UPDATE_DISP(d1);
SHUFFLE(_mm256_permute2f128_pd,1);
UPDATE_DISP(d2);
SHUFFLE(_mm256_shuffle_pd,0b0101);
UPDATE_DISP(d3);
}
// store updated displacements
_mm256_store_pd(&U[ i-i0 ], ui);
_mm256_store_pd(&U[ i-i0+TILEH ], vi);
_mm256_store_pd(&U[i-i0+2*TILEH], wi);
}
}
// add tile displacements to the coordinates
for(size_t k=0; k<3; ++k)
{
for(size_t i=i0; i<iend; i+=SIMDLEN)
{
__m256d
x = _mm256_load_pd(&X[i+k*nPt]),
u = _mm256_load_pd(&U[i-i0+k*TILEH]);
x = _mm256_add_pd(x,u);
_mm256_stream_pd(&X[i+k*nPt], x);
}
}
}
}
So what more can I do to it? Or, am I doing something very wrong?
Thank you, P. Gomes
First check perf counters for arith.divider_active
being ~= core clock cycles.
98% of the function runtime can be explained by taking the number of square roots and the operation throughput.
Or that works too.
If that's the case, you're saturating the (not fully pipelined) divider throughput and there's not much left to gain from just exposing more ILP.
Algorithmic changes are your only real chance to gain anything, e.g avoid some sqrt
operations or use single-precision.
Single-precision gives you 2x as much work per vector for free. But for sqrt-heavy workloads there's an additional gain: vsqrtps
throughput per vector is usually better than vsqrtpd
. That's the case on Skylake: one per 6 cycles vs. vsqrtpd at one per 9 to 12 cycles. That could move the bottleneck away from the sqrt/divide unit, perhaps to the front-end or the FMA unit.
vrsqrtps
has been suggested in comments. That would be worth considering (if single-precision is an option), but it's not a clear win when you need a Newton Raphson iteration to get enough precision. Bare x * rsqrtps(x)
without Newton Raphson is probably too inaccurate (and needs a cmp/AND to work around x==0.0
), but an NR iteration can take too many extra FMA uops to be worth it.
(AVX512 with vrsqrt14ps/pd
has more precision in the approximation, but usually still not enough to use without Newton. But interestingly it does exist for double-precision. Of course if you're on Xeon Phi, sqrt is very slow and you're intended to use AVX512ER vrsqrt28pd
+ Newton, or just vrsqrt28ps
on its own.)
Last time I tuned a function including a sqrt of a polynomial-approximation for Skylake, fast-approx reciprocals weren't worth it. Hardware single-precision sqrt was the best choice that gave us the required precision (and we weren't even considering needing double
). There was more work than yours between sqrt operations, though.