I'm using benchmark.js to clock two versions of a function, one in JS and one in C++ (a node.js binding).
The C++ version is a for loop with a single compiler intrinsic (2 cycle latency + 0.5 cycle throughput):
for (size_t i = 0; i < arrlen; i++) {
#if defined(_MSC_VER)
(*events)[i] = _byteswap_ushort((*events)[i]);
#elif defined(__GNUC__)
(*events)[i] = __builtin_bswap16((*events)[i]);
#endif
}
I expect it to be fast ... but it's clocking faster than my CPU frequency (4.0 GHz). How can this be happening? (I have tested that the function works outside of the benchmark suite.)
native: 17,253,787,071 elements/sec (10k elements in array * 1,725,379 calls/sec)
JS: 846,298,297 elements/sec (10k elements in array * 84,630 calls/sec)
// both ~90 runs sampled
Hard to say exactly without more context, but probably one or more of the following:
The compiler is using instructions such as PSHUFB to byte-swap more than one element at a time. (PSHUFB can potentially swap up to 16 words at a time on processors with AVX2 support.)
Pipelining effects are allowing the processor to handle multiple iterations of this loop simultaneously.
There is a problem with your benchmark which is allowing the entire calculation to be optimized away. (Unlikely but worth mentioning.)