SIMD: implement _mm256_max_epu64_ and _mm256_min_epu64_

I want to ask a question about SIMD. I don't get the AVX512 in my CPU but want to have a _mm256_max_epu64.

How can we implement this function with AVX2?

Here I try to have my trivial one. Maybe we can let it be a discussion and improve that.

#define SIMD_INLINE inline __attribute__ ((always_inline)) 

SIMD_INLINE __m256i __my_mm256_max_epu64_(__m256i a, __m256i b) {
  uint64_t *val_a = (uint64_t*) &a;
  uint64_t *val_b = (uint64_t*) &b;
  uint64_t e[4];
  for (size_t i = 0; i < 4; ++i) e[i] = (*(val_a + i) > *(val_b + i)) ? *(val_a + i) : *(val_b + i);
  return _mm256_set_epi64x(e[3], e[2], e[1], e[0]);
}

EDIT as a Summary:

We had a discussion about the __mm256 unsigned comparing. I gave my trivial implementation above just following the very basic concept: a single __m256i is just equivalent with 4 uint64_t or 4 float, which also make up 256 bits together.

Then we had the answer from @chtz, which makes more AVX sense with calling more bit programming functions from AVX.

At end it turns out these two implementation result in the same assembly thanks to CLang. Assembly example from compiler explorer

Another _mm256_min_epu64_ added. It is just mirroring the _mm256_max_epu64_ above. Make it easier to be searched for the future use.

SIMD_INLINE __m256i __my_mm256_min_epu64_(__m256i a, __m256i b) {
  uint64_t *val_a = (uint64_t*) &a;
  uint64_t *val_b = (uint64_t*) &b;
  uint64_t e[4];
  for (size_t i = 0; i < 4; ++i) e[i] = (*(val_a + i) < *(val_b + i)) ? *(val_a + i) : *(val_b + i);
  return _mm256_set_epi64x(e[3], e[2], e[1], e[0]);
}

Solution

The simplest solution would be a combination of _mm256_cmpgt_epi64 with a blend. However, if you want the unsigned maximum, you need to first subtract 1<<63 from each element (before comparison, not before blending). There is no _mm256_blendv_epu64 instruction, but it is possible to use _mm256_blendv_epi8 since the mask will set at every bit of the relevant elements. Also note that subtracting the uppermost bit can be done by a slightly faster xor:

__m256i pmax_epu64(__m256i a, __m256i b)
{
    __m256i signbit = _mm256_set1_epi64x(0x8000'0000'0000'0000);
    __m256i mask = _mm256_cmpgt_epi64(_mm256_xor_si256(a,signbit),_mm256_xor_si256(b,signbit));
    return _mm256_blendv_epi8(b,a,mask);
}

Actually, clang almost manages to get the same instructions from your code: https://godbolt.org/z/afhdOa It only uses vblendvpd instead of vpblendvb, which may introduce latencies (see @PeterCordes comment for details).

With some bit-twiddeling you could actually save setting the register for the signbit. An unsigned comparison gives the same result if the signs of both operands match and the opposite results if they don't match, i.e.

unsigned_greater_than(signed a, signed b) == (a<0) ^ (b<0) ^ (a>b)

This can be used if you use the _mm256_blendv_pd with some casting as a _mm256_blendv_epi64 (because now only the uppermost bit is valid):

__m256i _mm256_blendv_epi64(__m256i a, __m256i b, __m256i mask)
{
    return _mm256_castpd_si256(_mm256_blendv_pd(
        _mm256_castsi256_pd(a),_mm256_castsi256_pd(b),_mm256_castsi256_pd(mask)));
}

__m256i pmax_epu64_b(__m256i a, __m256i b)
{
    __m256i opposite_sign = _mm256_xor_si256(a,b);
    __m256i mask = _mm256_cmpgt_epi64(a,b);
    return _mm256_blendv_epi64(b,a,_mm256_xor_si256(mask, opposite_sign));
}

Just for reference, a signed maximum is of course just:

__m256i pmax_epi64(__m256i a, __m256i b)
{
    __m256i mask = _mm256_cmpgt_epi64(a,b);
    return _mm256_blendv_epi8(b,a,mask);
}