Find the INDEX of element having max. absolute value using AVX512 instructions

I'm a newbie for coding using AVX512 instructions. My machine is Intel KNL 7250. I am trying to use AVX512 instructions to find the INDEX of the element having maximum absolute value, which double precision and size of array % 8 = 0. But it prints an output index = 0 every time. I don't know where is a problem, please help me. Also, how to use printf for __m512i type?

Thanks.

Code:

void main()
{
    int i;
    int N=160;
    double vec[N];

    for(i=0;i<N;i++)
    {
        vec[i]=(double)(-i) ;
        if(i==10)
        {
            vec[i] = -1127;
        }
    }

    int max = avxmax_(N, vec);
    printf("maxindex=%d\n", max);
}

int avxmax_(int   N,     double *X )
{
    // return the index of element having maximum absolute value.
    int maxindex, ix, i, M;
    register __m512i increment, indices, maxindices, maxvalues, absmax, max_values_tmp, abs_max_tmp, tmp;
    register __mmask8 gt;
    double values_X[8];
    double indices_X[8];
    double maxvalue;
    maxindex = 1;
    if( N == 1) return(maxindex);


    M = N % 8;
    if( M == 0)
    {
        increment =  _mm512_set1_epi64(8); // [8,8,8,8,8,8,8,8]
        indices = _mm512_setr_epi64(0, 1, 2, 3, 4, 5, 6, 7);
        maxindices = indices;
        maxvalues =  _mm512_loadu_si512(&X[0]);
        absmax = _mm512_abs_epi64(maxvalues); 

        for( i = 8; i < N; i += 8)
        {
            // advance scalar indices: indices[0] + 8, indices[1] + 8,...,indices[7] + 8
            indices = _mm512_add_epi64(indices, increment);

            // compare
            max_values_tmp  = _mm512_loadu_si512(&X[i]);
            abs_max_tmp = _mm512_abs_epi64(max_values_tmp);
            gt           = _mm512_cmpgt_epi64_mask(abs_max_tmp, absmax);

            // update
            maxindices = _mm512_mask_blend_epi64(gt, maxindices, indices);
            absmax     = _mm512_max_epi64(absmax, abs_max_tmp);
        }

        // scalar part
        _mm512_storeu_si512((__m512i*)values_X, absmax);
        _mm512_storeu_si512((__m512i*)indices_X, maxindices);
        maxindex = indices_X[0];
        maxvalue = values_X[0];
        for(i = 1; i < 8; i++)
        {
            if(values_X[i] > maxvalue)
            {
                maxvalue = values_X[i];
                maxindex = indices_X[i];
            }
            
        }
        return(maxindex);
    }
}

Solution

Your function returns 0 because you're treating the int64 index as the bit-pattern for a double, and converting that (tiny) number to an integer. double indices_X[8]; is the bug; should be uint64_t. There are other bugs, see below.

This bug is easier to spot if you declare variables as you use them, C99 style, not obsolete C89 style.

You _mm512_storeu_si512 the vector of int64_t indices into double indices_X[8], type-punning it to double, then in plain C do int maxindex = indices_X[0];. This is implicit type-conversion, converting that subnormal double to an integer.

(I noticed a mysterious vcvttsd2si FP->int conversion in the asm https://godbolt.org/z/zsfc36 while converting the code to C99 style variable declarations next to initializers. That was a clue: there should be no FP->int conversion in this function. I noticed that around the same time I was moving the double indices_X[8]; declaration down into the block that uses it, and noticing it had type double.)

It is actually possible to use integer operations on FP bit-patterns

But only if you use the right ones! IEEE754 exponent biases mean that the encoding / bit-pattern can be compared as a sign/magnitude integer. So you can do abs / min / max and compare on it, but not of course integer add / sub (unless you're implementing nextafter).

_mm512_abs_epi64 is 2's complement abs, not sign-magnitude. Instead, you must just mask off the sign bit. Then you're all set to treat the result as an unsigned integer or signed-2's-complement. (Either works because the high bit is clear.)

Using integer max has the interesting property that NaNs will compare higher than any other value, Inf below that, then finite values. So we get a NaN-propagating max-finder basically for free.

On KNL (Knight's Landing), FP vmaxpd and vcmppd have the same performance as their integer equivalents: 2 cycle latency, 0.5c throughput. (https://agner.org/optimize/). So your way has zero advantage on KNL, but it's a neat trick for mainstream Intel, like Skylake-X and IceLake.

Bugfixed optimized version:

Use size_t for return type and loop counters / indices to handle potentially huge arrays, instead of a random mix of int and 64-bit vector elements. (uint64_t for the temp array that collects the horizontal-max: it's always 64-bit even in a build with 32-bit pointers / size_t.)
bugfix: return 0 on N==1, not 1: the index of the only element is 0.
bugfix: return -1 on N%8 != 0, instead of falling off the end of the non-void function. (Undefined behaviour if the caller uses the result in C, or as soon as execution falls off the end in C++).
bugfix: abs of an FP value = clear the sign bit, not 2's complement abs on the bit-pattern
sort of bugfix: use unsigned integer compare and max, so it would work for 2's complement integers with _mm512_abs_epi64 (which produces an unsigned result; remember that -LONG_MIN overflows to LONG_MIN if you keep treating it as signed).
style improvement: if (N%8 != 0) return -1; instead of putting most of the body in an if block.
style improvement: declare vars when they're first used, and removed some unused ones that were pure noise. This is idiomatic for C since C99, which was standardized over 20 years ago.
style improvement: use simpler names for tmp vector vars that just hold a load result. Sometimes you just need a tmp var because intrinsic names are so long that you don't want to type _mm...load... as an arg for another intrinsics. A name like v scoped to the inner loop is a clear sign it's just a placeholder, not used later. (This style works best when you're declaring it as you init it, so it's easy to see it can't be used in an outer scope.)
optimization: reduce 8 -> 4 elements after the loop with SIMD: extract the high half, combine with existing low half. (Same as you would for a simpler horizontal reduction like sum or max). Inconvenient when we need instructions that only AVX512 has, but KNL doesn't have AVX512VL, so we have to use the 512-bit version and ignore the high garbage. But KNL does have AVX1 / AVX2 so we can still store 256-bit vectors and do some things.

Using a merge-masking _mm512_mask_extracti64x4_epi64 extract to blend the high half directly the low half of the same vector is a cool trick which compilers don't find if you use a 512-bit mask-blend. :P
sort of bugfix: in C, main has a return type of int in hosted implementations (running under an OS).

#include <immintrin.h>
#include <stdio.h> 
#include <stdint.h>
#include <stdlib.h>

// bugfix: indices can be larger than an int
size_t avxmax_(size_t N,  double *X )
{
    // return the index of element having maximum absolute value.
    if( N == 1)
        return 0;    // bugfix: 0 is the only valid element in this case, not 1
    if( N % 8 != 0)      // [[unlikely]] // C++20
        return -1;   // bugfix: don't fall off the end of the function in this case

    const __m512i fp_absmask = _mm512_set1_epi64(0x7FFFFFFFFFFFFFFF);
    __m512i indices = _mm512_setr_epi64(0, 1, 2, 3, 4, 5, 6, 7);
    __m512i maxindices = indices;
    __m512i v =  _mm512_loadu_si512(&X[0]);
    __m512i absmax = _mm512_and_si512(v, fp_absmask);
    for(size_t i = 8; i < N; i += 8) // [[likely]]  // C++20
    {
        // advance indices by 8 each.
        indices = _mm512_add_epi64(indices, _mm512_set1_epi64(8));
        // compare
                v    = _mm512_loadu_si512(&X[i]);
        __m512i vabs = _mm512_and_si512(v, fp_absmask);
             // vabs = _mm512_abs_epi64(max_values_tmp);  // for actual integers, not FP bit patterns
        __mmask8 gt  = _mm512_cmpgt_epu64_mask(vabs, absmax);
        // update
        maxindices = _mm512_mask_blend_epi64(gt, maxindices, indices);
        absmax     = _mm512_max_epu64(absmax, vabs);
    }

    // reduce 512->256; KNL doesn't have AVX512VL so some ops require 512-bit vectors
    __m256i absmax_hi = _mm512_extracti64x4_epi64(absmax, 1);
    __m512i absmax_hi512 = _mm512_castsi256_si512(absmax_hi); // free
    __mmask8 gt = _mm512_cmpgt_epu64_mask(absmax_hi512, absmax);
    __m256i abs256 = _mm512_castsi512_si256(_mm512_max_epu64(absmax_hi512, absmax));  // reduced to low 4 elements

    // extract with merge-masking = blend
    __m256i maxindices256 = _mm512_mask_extracti64x4_epi64(
                           _mm512_castsi512_si256(maxindices), gt, maxindices, 1);

    // scalar part
    double values_X[4];
    uint64_t indices_X[4];
    _mm256_storeu_si256((__m256i*)values_X, abs256);
    _mm256_storeu_si256((__m256i*)indices_X, maxindices256);

    size_t maxindex = indices_X[0];
    double  maxvalue = values_X[0];
    for(int i = 1; i < 4; i++)
    {
        if(values_X[i] > maxvalue)
        {
            maxvalue = values_X[i];
            maxindex = indices_X[i];
        }
        
    }
    return maxindex;
}

On Godbolt: the main loop from GCC10.2 -O3 -march=knl is 8 instructions. So even if (best case) KNL could decode and run it at 2/clock, it's still taking 4 cycles per vector. You can run the program on Godbolt; it runs on Skylake-X servers so it can run AVX512 code. You can see it prints 10.

.L4:
        vpandd  zmm2, zmm5, ZMMWORD PTR [rsi+rax*8]   # load, folded into AND
        add     rax, 8
        vpcmpuq k1, zmm2, zmm0, 6
        vpaddq  zmm1, zmm1, zmm4                    # increment current indices
        cmp     rdi, rax
        vmovdqa64       zmm3{k1}, zmm1              # blend maxidx using merge-masking
        vpmaxuq zmm0, zmm0, zmm2
        ja      .L4

        vmovapd zmm1, zmm3                        # silly missed optimization related to the case where the loop runs 0 times.
.L3:
        vextracti64x4   ymm2, zmm0, 0x1           # high half of absmax
        vpcmpuq k1, zmm2, zmm0, 6                 # compare high and low
        vpmaxuq zmm0, zmm0, zmm2
      #  vunpckhpd       xmm2, xmm0, xmm0  # setting up for unrolled scalar loop
        vextracti64x4   ymm1{k1}, zmm3, 0x1       # masked extract of indices

Another option for the loop would be a masked vpbroadcastq zmm3{k1}, rax, adding the [0..7] per-element offsets after the loop. That would actually save the vpaddq in the loop, and we have the right i in a register if GCC is going to use an indexed addressing-mode anyway. (That's not good on Skylake-X; defeats micro-fusion of the memory-source vpandd.)

Agner Fog doesn't list performance for GP->vector broadcasts, but hopefully it's only single-uop on KNL at least. (And https://uops.info/ doesn't have KNL or KNM results).

Branchy strategy: when a new max is very rare

If you expect finding a new max to be very rare (e.g. array is large and uniformly distributed, or at least not trending upwards), it could be even faster to broadcast the current max and branch on finding any greater vector element.

Finding a new max means branching out of the loop (which probably mispredicts, so that's slow) and broadcasting that element (probably with a tzcnt to find the element index, then a broadcast-load, and update the index).

Especially with KNL's 4-way SMT to hide branch miss costs, this could be an overall throughput win for large arrays; fewer instructions per element on average.

But probably significantly worse for inputs that do trend upwards, so a new max is found O(n) times on average, not sqrt(n) or log(n) or whatever frequency a uniform distribution would give us.

PS: to print vectors, store to an array and reload the elements. print a __m128i variable

Or use a debugger to show you their elements.