sse Examples and Free Source Code

Why is the generated assembly reordered when using intrinsics?...

c gcc x86 sse intrinsics

Auto-vectorizing: Convincing the compiler that alias check is not necessary...

c++opencv gcc vectorization sse

Is there a difference between SVML vs. normal intrinsic square root functions?...

c++intel sse intrinsics sse2

Vectorizing with unaligned buffers: using VMASKMOVPS: generating a mask from a misalignment count? O...

gcc assembly x86 sse avx

In GNU C inline asm, what are the size-override modifiers for xmm/ymm/zmm for a single operand?...

c gcc sse inline-assembly avx512

Why does GCC or Clang not optimise reciprocal to 1 instruction when using fast-math...

c++sse compiler-optimization simd fast-math

Why do SSE instructions preserve the upper 128-bit of the YMM registers?...

performance x86 simd sse avx

How many clock cycles does cost AVX/SSE exponentiation on modern x86_64 CPU?...

c++x86 x86-64 sse avx

How to best emulate the logical meaning of _mm_slli_si128 (128-bit bit-shift), not _mm_bslli_si128...

c sse simd intrinsics sse2

Logarithm with SSE, or switch to FPU?...

sse simd logarithm natural-logarithm

parallel prefix (cumulative) sum with SSE...

c sum openmp sse

How to compute sine values somewhere, and then move them into XMM0 in assembly?...

assembly x86 sse x87 fpu

Why won't simple code get auto-vectorized with SSE and AVX in modern compilers?...

c optimization sse avx auto-vectorization

How to use Fused Multiply-Add (FMA) instructions with SSE/AVX...

c sse cpu-architecture avx fma

SSE4.1 slower than SSE3 on 4x4 matrix multiplication?...

c++matrix simd sse matmul

Does SSE/AVX provide a means of determining if a result was rounded up?...

x86 rounding sse simd avx

Write access violation on read instruction (MOVQ load on old Athlon XP)...

visual-c++x86 sse amd-processor sse2

What series of intrinsics will complete this paeth prediction code?...

c++sse intrinsics

Calculating constants for CRC32 using PCLMULQDQ...

sse crc32 modular-arithmetic galois-field

Classification of x86 instructions according to floating point rounding mode sensitivity?...

assembly floating-point x86-64 sse rounding-error

Why do x86 FP compares set CF like unsigned integers, instead of using signed conditions?...

assembly x86 sse sse2 x87

Intel x86_64 assembly compare signed double precision floats...

assembly x86-64 intel precision sse

How to efficiently perform double/int64 conversions with SSE/AVX?...

c++floating-point sse simd avx

Is there a way to utilize all XMM registers?...

c++c sse cpu-registers

Output errors when using libmvec intrinsics for trigo functions manually (like cosf)...

c++gcc glibc sse intrinsics

How to optimize cell-width measuring with SIMD (find the first column to have a non-zero byte in an ...

c x86-64 simd sse avx

Is worth using SSE or should I just rely on the compiler?...

c++optimization intel simd sse

Accelerate CRC32b using intel processors...

x86 intel sse crc32

Why does .NET use SIMD and not x87 for math operations not intrinsic to SIMD?...

.net assembly simd sse x87

Why is SSE4.2 cmpstr slower than regular code?...

c performance assembly x86 sse