Using AVX/AVX2 intrinsics, I can gather sets of 8 values, either 1,2 or 4 byte integers, or 4 byte floats using:
_mm256_i32gather_epi32()
_mm256_i32gather_ps()
But currently, I have a case where I am loading data that was generated on an nvidia GPU and stored as FP16 values. How can I do vectorized loads of these values?
So far, I found the _mm256_cvtph_ps() intrinsic.
However, input for that intrinsic is a __m128i value, not a __m256i value.
Looking at the Intel Intrinsics Guide, I see no gather operations that store 8 values into an _mm128i register?
How can I gather FP16 values into the 8 lanes of a __m256 register? Is it possible to vector load them as 2-byte shorts into __m256i and then somehow reduce that to a __m128i value to be passed into the conversion intrinsic? If so, I haven't found intrinsics to do that.
UPDATE
I tried the cast as suggested by @peter-cordes but I am getting bogus results from that. Also, I don't understand how that could work?
My 2-byte int values are stored in __m256i as:
0000XXXX 0000XXXX 0000XXXX 0000XXXX 0000XXXX 0000XXXX 0000XXXX 0000XXXX
so how can I simply cast to __m128i where it needs to be tightly packed as
XXXX XXXX XXXX XXXX XXXX XXXX XXXX XXXX
Will the cast do that?
My current code:
__fp16* fielddensity = ...
__m256i indices = ...
__m256i msk = _mm256_set1_epi32(0xffff);
__m256i d = _mm256_and_si256(_mm256_i32gather_epi32(fielddensity,indices,2), msk);
__m256 v = _mm256_cvtph_ps(_mm256_castsi256_si128(d));
But the result doesn't seem to be 8 properly formed values. I think every 2nd one is currently bogus for me?
There is indeed no gather instruction for 16bit values so you need to gather 32 bit values and ignore one half of them (and make sure that you don't accidentally read from invalid memory). Also, _mm256_cvtph_ps()
needs all input values in the lower 128 bit lane and unfortunately, there is no lane-crossing 16 bit shuffle (until AVX512).
However, assuming you have only finite input values, you could do some bit-twiddling (avoiding the _mm256_cvtph_ps()
). If you load a half precision value into the upper half of a 32 bit register you can do the following operations:
SEEEEEMM MMMMMMMM XXXXXXXX XXXXXXXX // input Sign, Exponent, Mantissa, X=garbage
Shift arithmetically to the right by 3 (this keeps the sign bit where it needs to be):
SSSSEEEE EMMMMMMM MMMXXXXX XXXXXXXX
Mask away excessive sign bits and garbage at the bottom (with 0b1000'11111'11111111111'0000000000000
)
S000EEEE EMMMMMMM MMM00000 00000000
This will be a valid single precision float but the exponent will be off by 112=127-15
(the difference between the biases), i.e. you need to multiply these values by 2**112
(this may be combined with any subsequent operation, you intend to do anyway later). Note that this will also convert sub-normal float16 values to the corresponding sub-normal float32 value (which are also off by a factor of 2**112
).
Untested intrinsic version:
__m256 gather_fp16(__fp16 const* fielddensity, __m256i indices){
// subtract 2 bytes from base address to load data into high parts:
int32_t const* base = (int32_t const*) ( fielddensity - 1);
// Gather 32bit values.
// Be aware that this reads two bytes before each desired value,
// i.e., make sure that reading fielddensitiy[-1] is ok!
__m256i d = _mm256_i32gather_epi32(base, indices, 2);
// shift exponent bits to the right place and mask away excessive bits:
d = _mm256_and_si256(_mm256_srai_epi32(d, 3), _mm256_set1_epi32(0x8fffe000));
// scale values to compensate bias difference (could be combined with subsequent operations ...)
__m256 two112 = _mm256_castsi256_ps(_mm256_set1_epi32(0x77800000)); // 2**112
__m256 f = _mm256_mul_ps(_mm256_castsi256_ps(d), two112);
return f;
}