Search code examples
c++simdintrinsicsavxavx2

How do the AVX(2) gather instructions actually compute the fetch address?


The current Intel intrinsics guide for _mm_i32gather_epi32() describes the computed address for each subword as:

addr := base_addr + SignExtend64(vindex[m+31:m]) * ZeroExtend64(scale) * 8

That last 8 puzzles me. Assuming addr and base_addr are in bytes and scale takes a value of 1, 2, 4 or 8, then you can only ever index strides of 8 bytes from the base address. Is this an error in the docs, or am I missing something? It's described the same way for all the gather instructions I checked.

A previous question quotes the docs without that 8 which suggests something has changed.


Solution

  • Note the next line in the pseudo-code:

    dst[i+31:i] := MEM[addr+31:addr]
    

    Apparently someone decided it would be a good idea to describe the memory address as a bit-address, not a byte-address. /facepalm. Which doesn't really make sense, is not what anyone would expect, and isn't even done right because they failed to scale base_addr by 8. So they're adding a bit-offset to a byte address.

    This is just poor documentation, and is a worse way to try to describe it than the previous version quoted in the linked question. It's just a documentation change, not a change to what the code means, and you could have tried compiling it and looking at the asm to see the actual instruction generated. (My answer on the question you linked is still correct: the asm instruction allows a scale factor of 1, 2, 4, or 8, as a 2-bit shift count encoded the same way scalar instructions do for scaled-index addressing modes. So you can use a vector of byte offsets.)

    The previous better pseudo-code was:

    dst[i+31:i] := MEM[base_addr + SignExtend(vindex[i+31:i])*scale]
    

    So MEM[] (the virtual address space) is being indexed with the calculated byte offset, and the access width is 32-bit implied by the dst[31:0] bit width.


    As a rule of thumb, intrinsics generally map as directly as possible to the asm instructions. They wouldn't choose to define it in a way that requires the compiler to emit a vpslld ymm0, ymm1, 3 to scale the index register before running vpgatherdd.

    So you can consult the asm instruction's documentation (which sometimes has different pseudo-code, like in this case): https://www.felixcloutier.com/x86/vpgatherdd:vpgatherqd

    ...
        DATA_ADDR←BASE_ADDR + (SignExtend(VINDEX1[i+31:i])*SCALE + DISP;
        IF MASK[31+i] THEN
            DEST[i +31:i]←FETCH_32BITS(DATA_ADDR); // a fault exits the instruction
        FI;