Search code examples
cperformancedoublessequaternions

SSE with doubles, not worth it?


I read a bit about using SSE intrinsics and tried my luck with implementing quaternion rotation with doubles. Below are the normal and SSE functions I wrote,


void quat_rot(quat_t a, REAL* restrict b){
    ///////////////////////////////////////////
    // Multiply vector b by quaternion a     //
    ///////////////////////////////////////////
    REAL cross_temp[3],result[3];

    cross_temp[0]=a.el[2]*b[2]-a.el[3]*b[1]+a.el[0]*b[0];
    cross_temp[1]=a.el[3]*b[0]-a.el[1]*b[2]+a.el[0]*b[1];
    cross_temp[2]=a.el[1]*b[1]-a.el[2]*b[0]+a.el[0]*b[2];

    result[0]=b[0]+2.0*(a.el[2]*cross_temp[2]-a.el[3]*cross_temp[1]);
    result[1]=b[1]+2.0*(a.el[3]*cross_temp[0]-a.el[1]*cross_temp[2]);
    result[2]=b[2]+2.0*(a.el[1]*cross_temp[1]-a.el[2]*cross_temp[0]);

    b[0]=result[0];
    b[1]=result[1];
    b[2]=result[2];
}

With SSE


inline void cross_p(__m128d *a, __m128d *b, __m128d *c){
    const __m128d SIGN_NP = _mm_set_pd(0.0, -0.0);


    __m128d l1 = _mm_mul_pd( _mm_unpacklo_pd(a[1], a[1]), b[0] );

    __m128d l2 = _mm_mul_pd( _mm_unpacklo_pd(b[1], b[1]), a[0] );
    __m128d m1 = _mm_sub_pd(l1, l2);
            m1 = _mm_shuffle_pd(m1, m1, 1);
            m1 = _mm_xor_pd(m1, SIGN_NP);

            l1 = _mm_mul_pd( a[0], _mm_shuffle_pd(b[0], b[0], 1) );

    __m128d m2 = _mm_sub_sd(l1, _mm_unpackhi_pd(l1, l1));

    c[0] = m1;
    c[1] = m2;
}

void quat_rotSSE(quat_t a, REAL* restrict b){
    ///////////////////////////////////////////
    // Multiply vector b by quaternion a     //
    ///////////////////////////////////////////

    __m128d axb[2];
    __m128d aa[2];
            aa[0] = _mm_load_pd(a.el+1);
            aa[1] = _mm_load_sd(a.el+3);
    __m128d bb[2];
            bb[0] = _mm_load_pd(b);
            bb[1] = _mm_load_sd(b+2);

    cross_p(aa, bb, axb);

    __m128d w = _mm_set1_pd(a.el[0]);

    axb[0] = _mm_add_pd(axb[0], _mm_mul_pd(w, bb[0]));
    axb[1] = _mm_add_sd(axb[1], _mm_mul_sd(w, bb[1]));

    cross_p(aa, axb, axb);

    _mm_store_pd(b, _mm_add_pd(bb[0], _mm_add_pd(axb[0], axb[0])));
    _mm_store_sd(b+2, _mm_add_pd(bb[1], _mm_add_sd(axb[1], axb[1])));
}

The rotation is basically done using the function,

enter image description here

I then run the following test to check how much time each function takes to do a set of rotations,


int main(int argc, char *argv[]){

    REAL a[] __attribute__ ((aligned(16))) = {0.2, 1.3, 2.6};
    quat_t q = {{0.1, 0.7, -0.3, -3.2}};

    REAL sum = 0.0;
    for(int i = 0; i < 4; i++) sum += q.el[i] * q.el[i];
    sum = sqrt(sum);
    for(int i = 0; i < 4; i++) q.el[i] /= sum;

    int N = 1000000000;

    for(int i = 0; i < N; i++){
        quat_rotSSE(q, a);
    }

    printf("rot = ");
    for(int i = 0; i < 3; i++) printf("%f, ", a[i]);
    printf("\n");

    return 0;
}

I compiled using gcc 4.6.3 with -O3 -std=c99 -msse3.

The timings for the normal function, using the unix time, was 18.841s and 21.689s for the SSE one.

Am I missing something, why is the SSE implementation 15% slower than the normal one? In which cases would an SSE implementation be faster for double precision?


EDIT: Taking advice from the comments, I tried several things,

  • -O1 option gives very similar results.
  • Tried using restrict on the cross_p function and added an __m128d to hold the second cross product. This had no difference in the assembly produced.
  • The assembly produced for the normal function, from what I understand, only contains scalar instructions except of some movapd.

The assembly code generated for the SSE function is only 4 lines less than the normal one.


EDIT: Added links to the assembly generated,

quat_rot

quat_rotSSE


Solution

  • SSE (and SIMD in general) works really well when you're performing the same operations on a large number of elements, where there's no dependencies between operations. For example, if you had an array of double and needed to do array[i] = (array[i] * K + L)/M + N; for each element then SSE/SIMD would help.

    If you're not performing the same operations on a large number of elements, then SSE doesn't help. For example, if you had one double and needed to do foo = (foo * K + L)/M + N; then SSE/SIMD isn't going to help.

    Basically, SSE is the wrong tool for the job. You need to change the job into something where SSE is the right tool. For example, rather than multiplying one vector by one quaternion; try multiplying an array of 1000 vectors by a quaternion, or perhaps multiplying an array of 1000 vectors by an array of 1000 quaternions.

    EDIT: Added everything below here!

    Note that this typically means modifying data structures to suit. For example, rather than having an array of structures it's often better to have a structure of arrays.

    For a better example, imagine your code uses an array of quaternions, like this:

    for(i = 0; i < quaternionCount; i++) {
       cross_temp[i][0] = a[i][2] * b[i][2] - a[i][3] * b[i][1] + a[i][0] * b[i][0];
       cross_temp[i][1] = a[i][3] * b[i][0] - a[i][1] * b[i][2] + a[i][0] * b[i][1];
       cross_temp[i][2] = a[i][1] * b[i][1] - a[i][2] * b[i][0] + a[i][0] * b[i][2];
    
       b[i][0] = b[i][0] + 2.0 * (a[i][2] * cross_temp[i][2] - a[i][3] * cross_temp[i][1]);
       b[i][1] = b[i][1] + 2.0 * (a[i][3] * cross_temp[i][0] - a[i][1] * cross_temp[i][2]);
       b[i][2] = b[i][2] + 2.0 * (a[i][1] * cross_temp[i][1] - a[i][2] * cross_temp[i][0]);
    }
    

    The first step would be to transform it into a quaternion of arrays, and do this:

    for(i = 0; i < quaternionCount; i++) {
       cross_temp[0][i] = a[2][i] * b[2][i] - a[3][i] * b[1][i] + a[0][i] * b[0][i];
       cross_temp[1][i] = a[3][i] * b[0][i] - a[1][i] * b[2][i] + a[0][i] * b[1][i];
       cross_temp[2][i] = a[1][i] * b[1][i] - a[2][i] * b[0][i] + a[0][i] * b[2][i];
    
       b[0][i] = b[0][i] + 2.0 * (a[2][i] * cross_temp[2][i] - a[3][i] * cross_temp[1][i]);
       b[1][i] = b[1][i] + 2.0 * (a[3][i] * cross_temp[0][i] - a[1][i] * cross_temp[2][i]);
       b[2][i] = b[2][i] + 2.0 * (a[1][i] * cross_temp[1][i] - a[2][i] * cross_temp[0][i]);
    }
    

    Then, because 2 adjacent doubles fit into a single SSE register you want to unroll the loop by 2:

    for(i = 0; i < quaternionCount; i += 2) {
       cross_temp[0][i] = a[2][i] * b[2][i] - a[3][i] * b[1][i] + a[0][i] * b[0][i];
       cross_temp[0][i+1] = a[2][i+1] * b[2][i+1] - a[3][i+1] * b[1][i+1] + a[0][i+1] * b[0][i+1];
       cross_temp[1][i] = a[3][i] * b[0][i] - a[1][i] * b[2][i] + a[0][i] * b[1][i];
       cross_temp[1][i+1] = a[3][i+1] * b[0][i+1] - a[1][i+1] * b[2][i+1] + a[0][i+1] * b[1][i+1];
       cross_temp[2][i] = a[1][i] * b[1][i] - a[2][i] * b[0][i] + a[0][i] * b[2][i];
       cross_temp[2][i+1] = a[1][i+1] * b[1][i+1] - a[2][i+1] * b[0][i+1] + a[0][i+1] * b[2][i+1];
    
       b[0][i] = b[0][i] + 2.0 * (a[2][i] * cross_temp[2][i] - a[3][i] * cross_temp[1][i]);
       b[0][i+1] = b[0][i+1] + 2.0 * (a[2][i+1] * cross_temp[2][i+1] - a[3][i+1] * cross_temp[1][i+1]);
       b[1][i] = b[1][i] + 2.0 * (a[3][i] * cross_temp[0][i] - a[1][i] * cross_temp[2][i]);
       b[1][i+1] = b[1][i+1] + 2.0 * (a[3][i+1] * cross_temp[0][i+1] - a[1][i+1] * cross_temp[2][i+1]);
       b[2][i] = b[2][i] + 2.0 * (a[1][i] * cross_temp[1][i] - a[2][i] * cross_temp[0][i]);
       b[2][i+1] = b[2][i+1] + 2.0 * (a[1][i+1] * cross_temp[1][i+1] - a[2][i+1] * cross_temp[0][i+1]);
    }
    

    Now, you want to break this down into individual operations. For example, the first 2 lines of the inner loop would become:

       cross_temp[0][i] = a[2][i] * b[2][i];
       cross_temp[0][i] -= a[3][i] * b[1][i];
       cross_temp[0][i] += a[0][i] * b[0][i];
       cross_temp[0][i+1] = a[2][i+1] * b[2][i+1];
       cross_temp[0][i+1] -= a[3][i+1] * b[1][i+1];
       cross_temp[0][i+1] += a[0][i+1] * b[0][i+1];
    

    Now re-order that:

       cross_temp[0][i] = a[2][i] * b[2][i];
       cross_temp[0][i+1] = a[2][i+1] * b[2][i+1];
    
       cross_temp[0][i] -= a[3][i] * b[1][i];
       cross_temp[0][i+1] -= a[3][i+1] * b[1][i+1];
    
       cross_temp[0][i] += a[0][i] * b[0][i];
       cross_temp[0][i+1] += a[0][i+1] * b[0][i+1];
    

    Once you've done all of this, think about converting to SSE. The first 2 lines of code is one load (that loads both a[2][i] and a[2][i+1] into an SSE register) followed by one multiplication (and not 2 separate loads and 2 separate multiplications). Those 6 lines might become (pseudo-code):

       load SSE_register1 with both a[2][i] and a[2][i+1]
       multiply SSE_register1 with both b[2][i] and b[2][i+1]
    
       load SSE_register2 with both a[3][i] and a[3][i+1]
       multiply SSE_register2 with both b[1][i] and b[1][i+1]
    
       load SSE_register2 with both a[0][i] and a[0][i+1]
       multiply SSE_register2 with both b[0][i] and b[0][i+1]
    
       SE_register1 = SE_register1 - SE_register2
       SE_register1 = SE_register1 + SE_register3
    

    Each line of pseudo-code here is a single SSE instruction/intrinsic; and every SSE instruction/intrinsic is doing 2 operations in parallel.

    If every instruction does 2 operations in parallel, then (in theory) it could be up to twice as fast as the original "one operation per instruction" code.