Search code examples
c++simdintrinsicssse2exp

How do you process exp() with SSE2?


I'm making a code that essentially takes advantage of SSE2 on optimizing this code:

double *pA = a;
double *pB = b[voiceIndex];
double *pC = c[voiceIndex];

for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) {
    pC[sampleIndex] = exp((mMin + std::clamp(pA[sampleIndex] + pB[sampleIndex], 0.0, 1.0) * mRange) * ln2per12);
}

in this:

double *pA = a;
double *pB = b[voiceIndex];
double *pC = c[voiceIndex];

// SSE2
__m128d bound_lower = _mm_set1_pd(0.0);
__m128d bound_upper = _mm_set1_pd(1.0);
__m128d rangeLn2per12 = _mm_set1_pd(mRange * ln2per12);
__m128d minLn2per12 = _mm_set1_pd(mMin * ln2per12);

__m128d loaded_a = _mm_load_pd(pA);
__m128d loaded_b = _mm_load_pd(pB);
__m128d result = _mm_add_pd(loaded_a, loaded_b);
result = _mm_max_pd(bound_lower, result);
result = _mm_min_pd(bound_upper, result);
result = _mm_mul_pd(rangeLn2per12, result);
result = _mm_add_pd(minLn2per12, result);

double *pCEnd = pC + roundintup8(blockSize);
for (; pC < pCEnd; pA += 8, pB += 8, pC += 8) {
    _mm_store_pd(pC, result);

    loaded_a = _mm_load_pd(pA + 2);
    loaded_b = _mm_load_pd(pB + 2);
    result = _mm_add_pd(loaded_a, loaded_b);
    result = _mm_max_pd(bound_lower, result);
    result = _mm_min_pd(bound_upper, result);
    result = _mm_mul_pd(rangeLn2per12, result);
    result = _mm_add_pd(minLn2per12, result);
    _mm_store_pd(pC + 2, result);

    loaded_a = _mm_load_pd(pA + 4);
    loaded_b = _mm_load_pd(pB + 4);
    result = _mm_add_pd(loaded_a, loaded_b);
    result = _mm_max_pd(bound_lower, result);
    result = _mm_min_pd(bound_upper, result);
    result = _mm_mul_pd(rangeLn2per12, result);
    result = _mm_add_pd(minLn2per12, result);
    _mm_store_pd(pC + 4, result);

    loaded_a = _mm_load_pd(pA + 6);
    loaded_b = _mm_load_pd(pB + 6);
    result = _mm_add_pd(loaded_a, loaded_b);
    result = _mm_max_pd(bound_lower, result);
    result = _mm_min_pd(bound_upper, result);
    result = _mm_mul_pd(rangeLn2per12, result);
    result = _mm_add_pd(minLn2per12, result);
    _mm_store_pd(pC + 6, result);

    loaded_a = _mm_load_pd(pA + 8);
    loaded_b = _mm_load_pd(pB + 8);
    result = _mm_add_pd(loaded_a, loaded_b);
    result = _mm_max_pd(bound_lower, result);
    result = _mm_min_pd(bound_upper, result);
    result = _mm_mul_pd(rangeLn2per12, result);
    result = _mm_add_pd(minLn2per12, result);
}

And I would say it works pretty well. BUT, can't find any exp function for SSE2, to complete the chain of operations.

Reading this, it seems I need to call standard exp() from library?

Really? Isn't this penalizing? Any other ways? Different builtin function?

I'm on MSVC, /arch:SSE2, /O2, producing 32-bit code.


Solution

  • There are several libraries that provide vectorized exponential, with more or less accuracy.

    • SVML, provided with the Intel compiler (it provides intrinsics as well, so if you have a licence, you can use them), has different level of precision (and speed)
    • you mentioned IPP, also from Intel, that also provide some functionality
    • MKL also provides some interface for this computation (for this one, fixing the ISA can be done through macros, for instance if you need reproducibility or precision)
    • fmath is another option, you can tear the code from the vectorized exp to integrate it inside your loop.

    From experience, all these are faster and more precise than a custom padde approximation (not even talking about the unstable Taylor expansion that would give you negative number VERY quickly).

    For SVML, IPP and MKL, I would check which one is better: calling from inside your loop or calling exp with one call for your full array (as the libraries could use AVX512 instead of just SSE2).