Search code examples
Efficient way for using int8 AVX512-VNNI instruction, especially about loading the data to zmm regis...


performanceintelmatrix-multiplicationavxavx512

Read More
AVX 32-bit integer to double precision float best practice...


avxavx2

Read More
I need more performance for int8 vector multiplication (Intel AVX-512)...


performancesimdavxavx2avx512

Read More
Have I written these sha256 #define's the correct way?...


calgorithmsha256avxsha2

Read More
What is the difference between shuffle and permute...


x86intelsimdnamingavx

Read More
Load and duplicate 4 single precision float numbers into a packed __m256 variable with fewest instru...


c++avx

Read More
Differences between AVX and AVX2...


x86matrix-multiplicationsimdavxavx2

Read More
Is this a gcc bug? Function returns 0 when looping an int* over elements of a __m256i...


cgccx86intrinsicsavx

Read More
SIMD: Accumulate Adjacent Pairs...


c++ssesimdintrinsicsavx

Read More
AVX512 assembly breaks when called concurrently from different goroutines...


goassemblyavxavx512

Read More
What are the best instruction sequences to generate vector constants on the fly?...


assemblyx86ssesimdavx

Read More
AVX2 integer shuffle with types other than byte?...


c#avxavx2

Read More
Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd?...


assemblygcccompiler-optimizationsimdavx

Read More
How to understand this AVX addition of two _m256i variables?...


c++vectoravxavx2avx512

Read More
Emulate AVX512 VPCOMPRESSB byte packing without AVX512_VBMI2...


x86-64simdavxavx512

Read More
Shifting SSE/AVX registers 32 bits left and right while shifting in zeros...


x86ssesimdavxavx2

Read More
Why gcc is so much worse at std::vector<float> vectorization of a conditional multiply than cl...


c++gccvectorizationcompiler-optimizationavx

Read More
Why won't simple code get auto-vectorized with SSE and AVX in modern compilers?...


coptimizationsseavxauto-vectorization

Read More
How to run bitwise OR on big vectors of u64 in the most performant manner?...


c++performanceassemblycpuavx

Read More
Using SIMD To Parallelize Matrix Multiplication For A 4x4, Row-Major Matrix...


cmatrix-multiplicationintrinsicsavx

Read More
AVX Intrinsic Clarification, 4x4 Matrix Multiplication Oddities...


c++cmatrix-multiplicationavx

Read More
Converting between Pair-wise and Component-wise in AVX...


csimdavxdouble-double-arithmetic

Read More
Squared Quaternion using AVX...


optimizationvectorizationquaternionsavx

Read More
AVX2 code to find the first longest match of 4-byte string among 8 4-byte targets...


bit-manipulationsimdavxavx2lz77

Read More
How to perform parallel addition using AVX with carry (overflow) fed back into the same element (PE ...


csimdavxavx2avx512

Read More
Is there an ARM Neon Gather Instruction?...


c++armsimdavxneon

Read More
Why does '_mm256_fmadd_ps' cause precision loss?...


cprecisionavxavx2fma

Read More
Unknown type name __m256 - Intel intrinsics for AVX not recognized?...


c++cintelintrinsicsavx

Read More
AVX MaskLoad/MaskStore performance...


c#simdavx

Read More
gcc: Optimize single function with `-mavx -mprefer-avx128`...


cgcccompiler-optimizationavx

Read More
BackNext