Search code examples
cvectorizationmultiplicationavxavx2

Fastest way to multiply an array of int64_t?


I want to vectorize the multiplication of two memory aligned arrays. I didn't find any way to multiply 64*64 bit in AVX/AVX2, so I just did loop-unroll and AVX2 loads/stores. Is there a faster way to do this?

Note: I don't want to save the high-half result of each multiplication.

void multiply_vex(long *Gi_vec, long q, long *Gj_vec){

    int i;
    __m256i data_j, data_i;
    __uint64_t *ptr_J = (__uint64_t*)&data_j;
    __uint64_t *ptr_I = (__uint64_t*)&data_i;


    for (i=0; i<BASE_VEX_STOP; i+=4) {
        data_i = _mm256_load_si256((__m256i*)&Gi_vec[i]);
        data_j = _mm256_load_si256((__m256i*)&Gj_vec[i]);

        ptr_I[0] -= ptr_J[0] * q;
        ptr_I[1] -= ptr_J[1] * q;
        ptr_I[2] -= ptr_J[2] * q;
        ptr_I[3] -= ptr_J[3] * q;

        _mm256_store_si256((__m256i*)&Gi_vec[i], data_i);
    }


    for (; i<BASE_DIMENSION; i++)
        Gi_vec[i] -= Gj_vec[i] * q;
}

UPDATE: I'm using the Haswell microarchitecture with both ICC/GCC compilers. So both AVX and AVX2 is fine. I substitute the -= by the C intrisic _mm256_sub_epi64 after the multiplication loop-unroll, where it get some speedup. Currently, it is ptr_J[0] *= q; ...

I use __uint64_t but is a error. The right data type is __int64_t.


Solution

  • You seem to be assuming long is 64bits in your code, but then using __uint64_t as well. In 32bit, the x32 ABI, and on Windows, long is a 32bit type. Your title mentions long long, but then your code ignores it. I was wondering for a while if your code was assuming that long was 32bit.

    You're completely shooting yourself in the foot by using AVX256 loads but then aliasing a pointer onto the __m256i to do scalar operations. gcc just gives up and gives you the terrible code you asked for: vector load and then a bunch of extract and insert instructions. Your way of writing it means that both vectors have to be unpacked to do the sub in scalar as well, instead of using vpsubq.

    Modern x86 CPUs have very fast L1 cache that can handle two operations per clock. (Haswell and later: two loads and one store per clock). Doing multiple scalar loads from the same cache line is better than a vector load and unpacking. (Imperfect uop scheduling reduces the throughput to about 84% of that, though: see below)


    gcc 5.3 -O3 -march=haswell (Godbolt compiler explorer) auto-vectorizes a simple scalar implementation pretty well. When AVX2 isn't available, gcc foolishly still auto-vectorizes with 128b vectors: On Haswell, this will actually be about 1/2 the speed of ideal scalar 64bit code. (See the perf analysis below, but substitute 2 elements per vector instead of 4).

    #include <stdint.h>    // why not use this like a normal person?
    #define BASE_VEX_STOP 1024
    #define BASE_DIMENSION 1028
    
    // restrict lets the compiler know the arrays don't overlap,
    // so it doesn't have to generate a scalar fallback case
    void multiply_simple(uint64_t *restrict Gi_vec, uint64_t q, const uint64_t *restrict Gj_vec){
        for (intptr_t i=0; i<BASE_DIMENSION; i++)   // gcc doesn't manage to optimize away the sign-extension from 32bit to pointer-size in the scalar epilogue to handle the last less-than-a-vector elements
            Gi_vec[i] -= Gj_vec[i] * q;
    }
    

    inner loop:

    .L4:
        vmovdqu ymm1, YMMWORD PTR [r9+rax]        # MEM[base: vectp_Gj_vec.22_86, index: ivtmp.32_76, offset: 0B], MEM[base: vectp_Gj_vec.22_86, index: ivtmp.32_76, offset: 0B]
        add     rcx, 1    # ivtmp.30,
        vpsrlq  ymm0, ymm1, 32      # tmp174, MEM[base: vectp_Gj_vec.22_86, index: ivtmp.32_76, offset: 0B],
        vpmuludq        ymm2, ymm1, ymm3        # tmp173, MEM[base: vectp_Gj_vec.22_86, index: ivtmp.32_76, offset: 0B], vect_cst_.25
        vpmuludq        ymm0, ymm0, ymm3        # tmp176, tmp174, vect_cst_.25
        vpmuludq        ymm1, ymm4, ymm1        # tmp177, tmp185, MEM[base: vectp_Gj_vec.22_86, index: ivtmp.32_76, offset: 0B]
        vpaddq  ymm0, ymm0, ymm1    # tmp176, tmp176, tmp177
        vmovdqa ymm1, YMMWORD PTR [r8+rax]        # MEM[base: vectp_Gi_vec.19_81, index: ivtmp.32_76, offset: 0B], MEM[base: vectp_Gi_vec.19_81, index: ivtmp.32_76, offset: 0B]
        vpsllq  ymm0, ymm0, 32      # tmp176, tmp176,
        vpaddq  ymm0, ymm2, ymm0    # vect__13.24, tmp173, tmp176
        vpsubq  ymm0, ymm1, ymm0    # vect__14.26, MEM[base: vectp_Gi_vec.19_81, index: ivtmp.32_76, offset: 0B], vect__13.24
        vmovdqa YMMWORD PTR [r8+rax], ymm0        # MEM[base: vectp_Gi_vec.19_81, index: ivtmp.32_76, offset: 0B], vect__14.26
        add     rax, 32   # ivtmp.32,
        cmp     rcx, r10  # ivtmp.30, bnd.14
        jb      .L4 #,
    

    Translate that back to intrinsics if you want, but it's going to be a lot easier to just let the compiler autovectorize. I didn't try to analyse it to see if it's optimal.

    If you don't usually compile with -O3, you could use #pragma omp simd before the loop (and -fopenmp).

    Of course, instead of a scalar epilogue, it would prob. be faster to do an unaligned load of the last 32B of Gj_vec, and store into the last 32B of Gi_vec, potentially overlapping with the last store from the loop. (A scalar fallback is still needed if the arrays are smaller than 32B.)

    GCC13 and Clang17 still both vectorize this way, using another two vpmuludq instead of 1x vpmulld to get both 32-bit cross products done with slightly less shuffling. But they are able to optimize away the insert/extract in the OP's pointer-aliasing ptr_I[0] -= ptr_J[0] * q;, so that compiles to the same asm loop as the simple scalar version.


    Improved vector intrinsic version for AVX2

    From my comments on Z Boson's answer. Based on Agner Fog's vector class library code, with various improvements in 2023 along with a bugfix. (For 8 years, nobody noticed that I was adding the cross products to the bottom of the full result, not to the top half to match their place-values; thanks @Petr for actually testing this.)

    Agner Fog's version saves an instruction but bottlenecks on the shuffle port by using phadd + pshufd where I use psllq / pand / paddd. vpmulld to get both cross products packed together is 2 uops on Haswell and later Intel (same as two separate vpmuludq instructions but with less shuffling of inputs), but only 1 uops for either of two ports on Zen 3 and later AMD.

    Since one of your operands is constant, make sure to pass set1(q) as b, not a, so the "b_swap" shuffle can be hoisted.

    // replace hadd -> shuffle (4 uops) with shift/and/add (3 uops with less shuffle-port pressure)
    // The constant takes 2 insns to generate outside a loop.
    __m256i mul64_avx2 (__m256i a, __m256i b)
    {
        // There is no vpmullq until AVX-512. Split into 32-bit multiplies
        // Given a and b composed of high<<32 | low  32-bit halves
        // a*b = a_low*(u64)b_low  + (u64)(a_high*b_low + a_low*b_high)<<32;  // same for signed or unsigned a,b since we aren't widening to 128
        // the a_high * b_high product isn't needed for non-widening; its place value is entirely outside the low 64 bits.
    
        __m256i b_swap  = _mm256_shuffle_epi32(b, _MM_SHUFFLE(2,3, 0,1));   // swap H<->L
        __m256i crossprod  = _mm256_mullo_epi32(a, b_swap);                 // 32-bit L*H and H*L cross-products
    
        __m256i prodlh = _mm256_slli_epi64(crossprod, 32);          // bring the low half up to the top of each 64-bit chunk 
        __m256i prodhl = _mm256_and_si256(crossprod, _mm256_set1_epi64x(0xFFFFFFFF00000000)); // isolate the other, also into the high half were it needs to eventually be
        __m256i sumcross = _mm256_add_epi32(prodlh, prodhl);       // the sum of the cross products, with the low half of each u64 being 0.
    
        __m256i prodll  = _mm256_mul_epu32(a,b);                  // widening 32x32 => 64-bit  low x low products
        __m256i prod    = _mm256_add_epi32(prodll, sumcross);     // add the cross products into the high half of the result
        return  prod;
    }
    

    This has been tested and now works, including for values with non-zero high and low halves. See it on Godbolt.

    Note that this doesn't include the final subtract from the question, only the multiply.

    This version should perform a bit better on Haswell than gcc's autovectorized version. (Like maybe one vector per 4 cycles instead of one vector per 5 cycles, bottlenecked on port0 throughput. Replacing vpsllq (srli_epi64) with vpshufb with a vector constant, or vpslldq by 4 and doing the masking after adding, can get that port 0 bottleneck down to 3 cycles.) Either way is fine on later CPUs like Skylake and Rocket Lake, where https://uica.uops.info/ estimates a throughput bottleneck of 2.67 based on execution-port pressure (for just the multiply without a loop or subtract), with perfect scheduling. I went with the shift version so it didn't need an extra vector constant, and let the shift/and run in parallel for better critical-path latency in case this is part of a dependency chain, unlike in the question where we don't do much else with each element between load and store, so out-of-order exec doesn't have to work hard to overlap between iterations.

    We can avoid vector constants entirely by doing sumcross = ((crossprod >> 32) + crossprod) << 32. That still takes 3 instructions, but the AND becomes a shift, so it's another uop that competes with multiplies (on Intel P-cores). And it has worse critical-path latency. But it avoids the constant, and uses fewer temporary registers (for the same reason it has worse ILP and higher latency), so could be useful if multiplying as part of a larger loop body that would otherwise run out of registers.

    An AVX1 or SSE4.1 version (two elements per vector) would not be faster than 64-bit scalar imul r64, [mem]. Don't do it unless you already have your data in vectors, and want the result in a vector. (It might be better than extracting to scalar and back.) Or on a CPU that has slow imul r64, r64 like perhaps Silvermont-family. In that case _mm_add_epi32 is faster than _mm_add_epi64.

    add_epi64 isn't slower or larger code-size on any AVX2 CPUs, but the SSE version was on some earlier CPUs. I used add_epi32 when possible because it won't be slower, and could possibly save some power or perhaps have lower latency on some hypothetical future CPU. According to https://uops.info/, Intel since Haswell and AMD since Zen 1 run vpaddd identically to vpaddq, including Alder Lake E-cores.


    Perf analysis of GCC's autovectorized code (not the intrinsic version)

    Background: see Agner Fog's insn tables and microarch guide, and other links in the tag wiki.

    Until AVX512 (see below), this is probably only barely faster than scalar 64bit code: imul r64, m64 has a throughput of one per clock on Intel CPUs (but one per 4 clocks on AMD Bulldozer-family). load/imul/sub-with-memory-dest is 4 fused-domain uops on Intel CPUs (with an addressing mode that can micro-fuse, which gcc fails to use). The pipeline width is 4 fused-domain uops per clock, so even a large unroll can't get this to issue at one-per-clock. With enough unrolling, we'll bottleneck on load/store throughput. 2 loads and one store per clock is possible on Haswell, but stores-address uops stealing load ports will lower the throughput to about 81/96 = 84% of that, according to Intel's manual.

    So perhaps the best way for Haswell would load and multiply with scalar, (2 uops), then vmovq / pinsrq / vinserti128 so you can do the subtract with a vpsubq. That's 8 uops to load&multiply all 4 scalars, 7 shuffle uops to get the data into a __m256i (2 (movq) + 4 (pinsrq is 2 uops) + 1 vinserti128), and 3 more uops to do a vector load / vpsubq / vector store. So that's 18 fused-domain uops per 4 multiplies (4.5 cycles to issue), but 7 shuffle uops (7 cycles to execute). So nvm, this is no good compared to pure scalar.


    The autovectorized code is using 8 vector ALU instructions for each vector of four values. On Haswell, 5 of those uops (multiplies and shifts) can only run on port 0, so no matter how you unroll this algorithm it will achieve at best one vector per 5 cycles (i.e. one multiply per 5/4 cycles.)

    The shifts could be replaced with pshufb (port 5) to move the data and shift in zeros. (Other shuffles don't support zeroing instead of copying a byte from the input, and there aren't any known-zeros in the input that we could copy.)

    paddq / psubq can run on ports 1/5 on Haswell, or p015 on Skylake.

    Skylake runs pmuludq and immediate-count vector shifts on on p01, so it could in theory manage a throughput of one vector per max(5/2, 8/3, 11/4) = 11/4 = 2.75 cycles. So it bottlenecks on total fused-domain uop throughput (including the 2 vector loads and 1 vector store). So a bit of loop unrolling will help. Probably resource conflicts from imperfect scheduling will bottleneck it to slightly less than 4 fused-domain uops per clock. The loop overhead can hopefully run on port 6, which can only handle some scalar ops, including add and compare-and-branch, leaving ports 0/1/5 for vector ALU ops, since they're close to saturating (8/3 = 2.666 clocks). The load/store ports are nowhere near saturating, though.

    So, Skylake can theoretically manage one vector per 2.75 cycles (plus loop overhead), or one multiply per ~0.7 cycles with GCC auto-vectorization, vs. Haswell's best option (one per ~1.2 cycles in theory with scalar, or one per 1.25 cycles in theory with vectors). The scalar one per ~1.2 cycles would probably require a hand-tuned asm loop, though, because compilers don't know how to use a one-register addressing mode for stores, and a two-register addressing mode for loads (dst + (src-dst) and increment dst).

    Also, if your data isn't hot in L1 cache, getting the job done with fewer instructions lets the frontend get ahead of the execution units, and get started on the loads before the data is needed. Hardware prefetch doesn't cross page lines, so a vector loop will probably beat scalar in practice for large arrays, and maybe even for smaller arrays.


    AVX-512DQ introduces a 64bx64b->64b vector multiply

    GCC can auto-vectorize using it if you add -mavx512dq. (In practice use -march=x86-64-v4, or -march=native or -march=skylake-avx512 or whatever to enable other stuff and set turning options.)

    .L4:
        vmovdqu64       zmm0, ZMMWORD PTR [r8+rax]    # vect__11.23, MEM[base: vectp_Gj_vec.22_86, index: ivtmp.32_76, offset: 0B]
        add     rcx, 1    # ivtmp.30,
        vpmullq zmm1, zmm0, zmm2  # vect__13.24, vect__11.23, vect_cst_.25
        vmovdqa64       zmm0, ZMMWORD PTR [r9+rax]    # MEM[base: vectp_Gi_vec.19_81, index: ivtmp.32_76, offset: 0B], MEM[base: vectp_Gi_vec.19_81, index: ivtmp.32_76, offset: 0B]
        vpsubq  zmm0, zmm0, zmm1    # vect__14.26, MEM[base: vectp_Gi_vec.19_81, index: ivtmp.32_76, offset: 0B], vect__13.24
        vmovdqa64       ZMMWORD PTR [r9+rax], zmm0    # MEM[base: vectp_Gi_vec.19_81, index: ivtmp.32_76, offset: 0B], vect__14.26
        add     rax, 64   # ivtmp.32,
        cmp     rcx, r10  # ivtmp.30, bnd.14
        jb      .L4 #,
    

    So AVX512DQ (expected to be part of Skylake multi-socket Xeon (Purley) in ~2017) will give a much larger than 2x speedup (from wider vectors) if these instructions are pipelined at one per clock.

    Update: Skylake-AVX512 (aka SKL-X or SKL-SP) runs VPMULLQ at one per 1.5 cycles for xmm, ymm, or zmm vectors. It's 3 uops with 15c latency. (With maybe an extra 1c of latency for the zmm version, if that's not a measurement glitch in the AIDA results.)

    vpmullq is much faster than anything you can build out of 32-bit chunks, so it's very much worth having an instruction for this even if current CPUs don't have 64-bit-element vector-multiply hardware. (Presumably they use the mantissa multipliers in the FMA units.)

    Zen 4 runs vpmullq as a single uop for either of 2 ports, so 2/clock throughput for 256-bit vectors, 1/clock for 512-bit vectors. Later Intel CPUs (like Alder Lake / Sapphire Rapids) still run it as 3 uops for ports 0/1, so 1.5c throughput. https://uops.info/