I am looking for a solution for saturating packed 64-bit integers to 8-bit integers. Looked at _mm256_cvtepi64_epi8
but instead of saturating, it truncates which results in unwanted output.
My program is as below:
int main()
{
__m256i a, b, c;
__m128i d;
a = _mm256_set1_epi64x(127);
b = _mm256_set1_epi64x(1);
c = _mm256_add_epi64x(a, b);
d = _mm256_cvtepi64_epi8(c);
}
I expect the output (d) to contain four 127
(saturated), however the program yields four -128
elements (truncated from 128
).
_mm256_cvtepi64_epi8
is AVX512. (Specifically AVX512VL; the 512-bit version is AVX512F). You tagged that but your (original) title only said AVX.
Anyway, your options include doing saturated addition in the first place with _mm256_adds_epi8
so you can have 8x as many elements per vector.
(And as discussed in comments, for 8x8 => 8-bit saturating multiply, you might just want in-lane unpack to feed _mm256_mullo_epi16
, and pack pairs of results back down with in-lane _mm256_packs_epi16
(vpacksswb
). Although sign-extending in-lane unpack is not convenient so you might consider vpmovsx
. Either way, you definitely don't need to widen more than 16-bit elements; int16_t
can hold the full product of two int8_t
without overflow.)
Or to do it the way you asked, AVX512 does have signed and unsigned saturation versions of the down-convert instructions, along with the truncation version you found. VPMOVQB
, VPMOVSQB
, and VPMOVUSQB
are all documented together.
__m128i _mm256_cvtsepi64_epi8(__m256i a);
does signed saturation. It's available in a version with an __m512i
source, and a version that stores to memory directly (optionally as a masked store).
(The store version is no more efficient on mainstream CPUs, but it did allow KNL / KNM (which lack AVX512BW) to do narrow byte-masked stores.)
Do not widen your data to 64-bit elements unless you have to. That's 1/8th of the work per vector compared to 8-bit elements, and 32x32 => 32-bit and 64x64 => 64-bit SIMD multiplies need 2 uops per instruction on Intel since Haswell.
Another option is to pack 2 vectors -> 1 vector of the same width as the 2 inputs, but they only work in-lane pack instructions. e.g. _mm256_packs_epi16
as mentioned above. They're only available for 2:1 element size ratios, not all the way from 64 or 32 to 8 in one step. (So yet another reason to avoid widening too much).
But if you look at the total number of shuffles to produce N bytes of output data, it tends to come out slightly ahead. e.g. for 4 input vectors, you need 2 + 1 shuffles instead of 4 to narrow from 32 to 8 bit. (And maybe a 4th shuffle if you need to fix up in-lane, if you weren't able to feed them instructions with data interleaved odd/even in 128-bit lanes). You have to look at the big picture of how many shuffles (or potentially other instructions like AND or AVX512 byte-masking) it takes to unpack as well as re-pack.
2:1 packing has the advantage of leading to wider stores if you're even storing the results. If not then that's an even bigger advantage over the new AVX512 1->1 vector narrowing instructions, where you'd need shuffles to recombine them into a 256-bit vector.