Converting packed 64-bit integers to packed 8-bit integers with signed saturation using AVX512

I am looking for a solution for saturating packed 64-bit integers to 8-bit integers. Looked at _mm256_cvtepi64_epi8 but instead of saturating, it truncates which results in unwanted output.

My program is as below:

int main()
{
    __m256i a, b, c;
    __m128i d;

    a = _mm256_set1_epi64x(127);
    b = _mm256_set1_epi64x(1);
    c = _mm256_add_epi64x(a, b);
    d = _mm256_cvtepi64_epi8(c);
}

I expect the output (d) to contain four 127 (saturated), however the program yields four -128 elements (truncated from 128).

Solution

_mm256_cvtepi64_epi8 is AVX512. (Specifically AVX512VL; the 512-bit version is AVX512F). You tagged that but your (original) title only said AVX.

Anyway, your options include doing saturated addition in the first place with _mm256_adds_epi8 so you can have 8x as many elements per vector.

(And as discussed in comments, for 8x8 => 8-bit saturating multiply, you might just want in-lane unpack to feed _mm256_mullo_epi16, and pack pairs of results back down with in-lane _mm256_packs_epi16 (vpacksswb). Although sign-extending in-lane unpack is not convenient so you might consider vpmovsx. Either way, you definitely don't need to widen more than 16-bit elements; int16_t can hold the full product of two int8_t without overflow.)

Or to do it the way you asked, AVX512 does have signed and unsigned saturation versions of the down-convert instructions, along with the truncation version you found. VPMOVQB, VPMOVSQB, and VPMOVUSQB are all documented together.

__m128i _mm256_cvtsepi64_epi8(__m256i a); does signed saturation. It's available in a version with an __m512i source, and a version that stores to memory directly (optionally as a masked store).

(The store version is no more efficient on mainstream CPUs, but it did allow KNL / KNM (which lack AVX512BW) to do narrow byte-masked stores.)

Do not widen your data to 64-bit elements unless you have to. That's 1/8th of the work per vector compared to 8-bit elements, and 32x32 => 32-bit and 64x64 => 64-bit SIMD multiplies need 2 uops per instruction on Intel since Haswell.

Another option is to pack 2 vectors -> 1 vector of the same width as the 2 inputs, but they only work in-lane pack instructions. e.g. _mm256_packs_epi16 as mentioned above. They're only available for 2:1 element size ratios, not all the way from 64 or 32 to 8 in one step. (So yet another reason to avoid widening too much).

But if you look at the total number of shuffles to produce N bytes of output data, it tends to come out slightly ahead. e.g. for 4 input vectors, you need 2 + 1 shuffles instead of 4 to narrow from 32 to 8 bit. (And maybe a 4th shuffle if you need to fix up in-lane, if you weren't able to feed them instructions with data interleaved odd/even in 128-bit lanes). You have to look at the big picture of how many shuffles (or potentially other instructions like AND or AVX512 byte-masking) it takes to unpack as well as re-pack.

2:1 packing has the advantage of leading to wider stores if you're even storing the results. If not then that's an even bigger advantage over the new AVX512 1->1 vector narrowing instructions, where you'd need shuffles to recombine them into a 256-bit vector.