Search code examples
Why do bit manipulation intrinsics like _bextr_u64 often perform worse than simple shift and mask op...


gccbit-manipulationx86-64intrinsicsavx

Read More
Vectorizing with unaligned buffers: using VMASKMOVPS: generating a mask from a misalignment count? O...


gccassemblyx86sseavx

Read More
Why do SSE instructions preserve the upper 128-bit of the YMM registers?...


performancex86simdsseavx

Read More
How many clock cycles does cost AVX/SSE exponentiation on modern x86_64 CPU?...


c++x86x86-64sseavx

Read More
Why won't simple code get auto-vectorized with SSE and AVX in modern compilers?...


coptimizationsseavxauto-vectorization

Read More
How does SIMD (avx) processing work? for example, if I want 10 32 bit floats how do i fit in a 256 b...


csimdavx

Read More
why is my simd vector plus and set slower than using std::transform and std::plus<T> - am i do...


c++vectorvectorizationsimdavx

Read More
How to use Fused Multiply-Add (FMA) instructions with SSE/AVX...


cssecpu-architectureavxfma

Read More
Does SSE/AVX provide a means of determining if a result was rounded up?...


x86roundingssesimdavx

Read More
Best way to mask a single bit in AVX2?...


cx86simdavxavx2

Read More
How to efficiently perform double/int64 conversions with SSE/AVX?...


c++floating-pointssesimdavx

Read More
What is the inverse of "_mm256_cvtepi16_epi32"...


x86g++intrinsicsavxavx2

Read More
AVX2: Get every second int32...


csimdavxavx2int32

Read More
How to optimize cell-width measuring with SIMD (find the first column to have a non-zero byte in an ...


cx86-64simdsseavx

Read More
I need more performance for int8 vector multiplication (Intel AVX-512)...


performancesimdavxavx2avx512

Read More
Efficient way for using int8 AVX512-VNNI instruction, especially about loading the data to zmm regis...


performanceintelmatrix-multiplicationavxavx512

Read More
AVX 32-bit integer to double precision float best practice...


avxavx2

Read More
Have I written these sha256 #define's the correct way?...


calgorithmsha256avxsha2

Read More
What is the difference between shuffle and permute...


x86intelsimdnamingavx

Read More
Load and duplicate 4 single precision float numbers into a packed __m256 variable with fewest instru...


c++avx

Read More
Differences between AVX and AVX2...


x86matrix-multiplicationsimdavxavx2

Read More
Is this a gcc bug? Function returns 0 when looping an int* over elements of a __m256i...


cgccx86intrinsicsavx

Read More
SIMD: Accumulate Adjacent Pairs...


c++ssesimdintrinsicsavx

Read More
AVX512 assembly breaks when called concurrently from different goroutines...


goassemblyavxavx512

Read More
What are the best instruction sequences to generate vector constants on the fly?...


assemblyx86ssesimdavx

Read More
AVX2 integer shuffle with types other than byte?...


c#avxavx2

Read More
Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd?...


assemblygcccompiler-optimizationsimdavx

Read More
How to understand this AVX addition of two _m256i variables?...


c++vectoravxavx2avx512

Read More
Emulate AVX512 VPCOMPRESSB byte packing without AVX512_VBMI2...


x86-64simdavxavx512

Read More
Shifting SSE/AVX registers 32 bits left and right while shifting in zeros...


x86ssesimdavxavx2

Read More
BackNext