Search code examples
c++assemblymicro-optimization

What's the most efficient way to swap 4 16-bit integers on a 64-bit processor?


I have four uint16 named a,b,c,d respectively, and now I'd like to swap them like this:

void swap4(uint16_t &a, uint16_t &b, uint16_t &c, uint16_t &d) {
    uint16_t temp = a;
    a = b;
    b = c;
    c = d;
    d = temp;
}

Is there anything I can do to speed up this procedure?


Solution

  • As noted, first make sure this is really a bottleneck: most compilers should generate efficient code for this (unless there's a possibility of aliasing between the arguments).

    If it happens that these 16-bit values are stored contiguously in memory (e.g. this is a four-element vector), then (a) make sure they're aligned on the right boundary! and (b) you can use your CPU's shuffle instruction, which is an optimization that your compiler might or might not recognize on its own. Before you go any further, check your compiler's assembly output; modern GCC with -O2 does in fact automatically recognize this simplification (https://godbolt.org/z/qo1jxnbds).

    If you really want to hand-roll, GCC provides a portable __builtin_shuffle macro for this; for your use case you could write

    typedef uint16_t quadword __attribute__ ((vector_size (8)));
    quadword input = {a, b, c, d};
    const quadword rotate_mask = {1, 2, 3, 0};
    quadword output = __builtin_shuffle (input, rotate_mask);
    

    (You probably don't want to write exactly that, but to recast your data as an array of those quadword types--- see the Compiler Explorer link above for an example.)

    For x86 the underlying instruction generated by this macro is pshufb/pshufw, which (if you're not on GCC, or don't want to be portable) you could access with the _mm_shuffle_pi16 intrinsic (https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=shuffle&techs=MMX,SSE&ig_expand=6426). Every modern RISC architecture offers something similar.