avx Examples and Free Source Code

Why do bit manipulation intrinsics like _bextr_u64 often perform worse than simple shift and mask op...

gcc bit-manipulation x86-64 intrinsics avx

Vectorizing with unaligned buffers: using VMASKMOVPS: generating a mask from a misalignment count? O...

gcc assembly x86 sse avx

Why do SSE instructions preserve the upper 128-bit of the YMM registers?...

performance x86 simd sse avx

How many clock cycles does cost AVX/SSE exponentiation on modern x86_64 CPU?...

c++x86 x86-64 sse avx

Why won't simple code get auto-vectorized with SSE and AVX in modern compilers?...

c optimization sse avx auto-vectorization

How does SIMD (avx) processing work? for example, if I want 10 32 bit floats how do i fit in a 256 b...

c simd avx

why is my simd vector plus and set slower than using std::transform and std::plus<T> - am i do...

c++vector vectorization simd avx

How to use Fused Multiply-Add (FMA) instructions with SSE/AVX...

c sse cpu-architecture avx fma

Does SSE/AVX provide a means of determining if a result was rounded up?...

x86 rounding sse simd avx

Best way to mask a single bit in AVX2?...

c x86 simd avx avx2

How to efficiently perform double/int64 conversions with SSE/AVX?...

c++floating-point sse simd avx

What is the inverse of "_mm256_cvtepi16_epi32"...

x86 g++intrinsics avx avx2

AVX2: Get every second int32...

c simd avx avx2 int32

How to optimize cell-width measuring with SIMD (find the first column to have a non-zero byte in an ...

c x86-64 simd sse avx

I need more performance for int8 vector multiplication (Intel AVX-512)...

performance simd avx avx2 avx512

Efficient way for using int8 AVX512-VNNI instruction, especially about loading the data to zmm regis...

performance intel matrix-multiplication avx avx512

AVX 32-bit integer to double precision float best practice...

avx avx2

Have I written these sha256 #define's the correct way?...

c algorithm sha256 avx sha2

What is the difference between shuffle and permute...

x86 intel simd naming avx

Load and duplicate 4 single precision float numbers into a packed __m256 variable with fewest instru...

c++avx

Differences between AVX and AVX2...

x86 matrix-multiplication simd avx avx2

Is this a gcc bug? Function returns 0 when looping an int* over elements of a __m256i...

c gcc x86 intrinsics avx

SIMD: Accumulate Adjacent Pairs...

c++sse simd intrinsics avx

AVX512 assembly breaks when called concurrently from different goroutines...

go assembly avx avx512

What are the best instruction sequences to generate vector constants on the fly?...

assembly x86 sse simd avx

AVX2 integer shuffle with types other than byte?...

c#avx avx2

Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd?...

assembly gcc compiler-optimization simd avx

How to understand this AVX addition of two _m256i variables?...

c++vector avx avx2 avx512

Emulate AVX512 VPCOMPRESSB byte packing without AVX512_VBMI2...

x86-64 simd avx avx512

Shifting SSE/AVX registers 32 bits left and right while shifting in zeros...

x86 sse simd avx avx2