Differences between AVX and AVX2

below is an implementation of a matrix multiply in AVX2. The machine I am using only supports AVX so I am trying to implement the same configuration with AVX.

However, I am having trouble deciphering really what the differences are, and what would needed to be changed! What in this implementation is specific to AVX2 that would not work with a machine only able to process AVX?

This is a link to all the commands for AVX as well as AVX2 https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=AVX

Thank you for any insight at all!

 for (uint64_t i = 0; i < M; i++)
     {
         for (uint64_t j = 0; j < N; j++)
         {
             __m256 X = _mm256_setzero_ps();
             for (uint64_t k = 0; k < L; k+= 8) {
                 const __m256 AV = _mm256_load_ps(A+i*L+k);
                 const __m256 BV = _mm256_load_ps(B+j*L+k);
                 X = _mm256_fmadd_ps(AV,BV,X);
             }
             C[i*N+j] = hsum_avx(X);
         }
     }

Solution

Your code uses AVX1 + FMA instructions, not AVX2. It would run ok on an AMD Piledriver, for example. (Assuming the hsum is implemented in a sane way, extracting the high half and then using 128-bit shuffles.).

If your AVX-only CPU doesn't have FMA either, you'd need to use _mm256_mul_ps and _mm256_add_ps.

For Intel, AVX2 and FMA were introduced in the same generation, Haswell, but those are different extensions. FMA is available in some CPUs without AVX2.

There is unfortunately even a VIA CPU with AVX2 but not FMA, otherwise AVX2 implies FMA unless you're in a VM or emulator that intentionally has a combination of extensions that real HW doesn't.

MSVC /arch:AVX2 and GCC / clang -march=x86-64-v3 both imply a Haswell feature level, AVX2+FMA+BMI1/2.

(There was an FMA4 extension in some AMD CPUs, with 4 operands (3 inputs and a separate output), Bulldozer through Zen1, after Intel pulled a switcheroo on AMD too late for them to change their Bulldozer design to support FMA3. That's why there's an AMD-only FMA4, and why it wasn't until Piledriver that AMD supported an FMA extension compatible with Intel. But that's part of the dust pile of history now, so usually we just say FMA to reference the extension that's technically called FMA3. See Agner Fog's 2009 blog Stop the instruction set war, and How do I know if I can compile with FMA instruction sets?)

AVX1: 256-bit FP only (no integer instructions except vptest, although FP in this case does include bitwise instructions like vxorps ymm). Shuffles are only in-lane (e.g. vshufps ymm or the new vpermilps) or with 128-bit granularity (vperm2f128 or vinsertf128 / vextractf128). AVX1 also provides VEX encodings of all SSE1..4 instructions including integer, with 3-operand non-destructive. e.g. vpsubb xmm0, xmm1, [rdi]
AVX2: 256-bit versions of integer SSE instructions, and new lane-crossing shuffles like vpermps / vpermd and vpermq / pd, and vbroadcastss/sd ymm, xmm with a register source (AVX1 only had vbroadcastss ymm, [mem]). Also an efficient vpblendd immediate integer blend instruction, like vblendps
FMA3: vfmadd213ps x/ymm, x/ymm, x/ymm/mem and so on. (And pd and scalar ss/sd version). Also fmsub.. (subtract the 3rd operand), fnmadd.. (negate the product), and even fmaddsub...ps. _mm256_fmadd_ps will compile to some form of vfmadd...ps, depending on which input operand the compiler wants to overwrite, and which operand it wants to use as the memory operand.

This order of introduction explains the bad choice of intrinsic naming, e.g. _mm256_permute_ps (immediate) and _mm256_permutevar_ps (vector control) are AVX1 vpermilps in-lane permute, with AVX2 getting saddled with _mm256_permutexvar_ps. So confusingly the intrinsic has an x for lane-crossing, while the asm mnemonic is just plain.