c++bit-manipulation simd avx2 branchless

Transform random integers into range [min,max] without branching

I got hold on an SUPER-FAST algorithm that generates an array of random bytes, uniformly. It's 6 times faster than c++ uniform distribution and mersenne-twister of std library.

The count of an array is divisible by 4, so it can be interpreted as array of integers. Casting each entry to an integer, produces values in the range [INT_MIN, INT_MAX]. But how can I transform these integer values to lie between my own [min, maximum]?

I want to avoid any if-else, to avoid branching.

Maybe I should apply some bitwise logic, to discard irrelevant bits in each number? (because all remaining, unmasked bits will be either 0 or 1 anyway). If I can extract the most significant bit in my maximum-value, I could mask any bits that are more significant than that one, in my integers.

For example, if I want my max to be 17, then it is 00010001 in binary form. Maybe my mask would then look as 00011111? I could then apply it to all numbers in my array.

But, this mask is wrong ...It actually allows values up to (1+2+4+8+16) :(

What can I do? Also, how to take care of the min?

Edit

I am generating millions of numbers every frame of my application, for neural networks. I managed to vectorize the code using AXV2 for float variants (using this post), but need to get integers working too.

Solution

But how can I transform these integer values to lie between my own [min, maximum]?

Since the range may not be a power of two, bitmasking is out, but you found that out already.

Modulo is also out, it does not exist as a native operation in AVX2 (and even if it did, that wouldn't necessarily make it efficient).

There is an other option: multiply-high, using _mm256_mul_epu32 (unfortunately there is no "pure" multiply-high for 32bit numbers, like there is for 16bit numbers, so we're stuck with an operation that only does 50% useful work). The idea there is to take the input number x (full range) and the desired range r, then compute r * x / 2^32 where the division is implicit (implemented by taking the high half of the product).

x / 2^32 would have been a number in [0.0 .. 1.0) (excluding 1.0) if it was interpreted as a rational number, multiplying by r then stretches the range to be [0.0 .. r) (excluding r). That's not how it's calculated, but that's where the formula comes from.

Setting the minimum of the range is handled easily by adding min to the scaled result.

In code (slightly tested):

__m256i squish(__m256i x, int min, int max) {
    __m256i sizeOfRange = _mm256_set1_epi32((unsigned)max - min);
    __m256i scaled_even = _mm256_shuffle_epi32(_mm256_mul_epu32(x, sizeOfRange), 0xB1);
    __m256i scaled_odd = _mm256_mul_epu32(_mm256_shuffle_epi32(x, 0xB1), sizeOfRange);
    __m256i scaled = _mm256_blend_epi32(scaled_even, scaled_odd, 0xAA);
    return _mm256_add_epi32(scaled, _mm256_set1_epi32(min));
}

It's still an exclusive range, it cannot handle the full [INT_MIN .. INT_MAX] as output range. There is no way to even specify it, the most it can do is [INT_MIN .. INT_MAX) (or for example an equivalent range with zero offset: [0 .. -1) ).

It's also not really uniform, for the same reason that the simple modulo-based range reduction isn't really uniform, you just cannot fairly divide N marbles over K bins unless K happens to divide N evenly.