simd Examples and Free Source Code

Implementation of __builtin_clz...

c gcc cpu simd

Fast bithacked log2 approximation...

math floating-point bit-manipulation simd

SIMD Intrinsics difference between Vector<T>, advsimd and sse?...

c#.net simd intrinsics

using SIMD on ARM cortex M4...

c arm clang simd cortex-m

Why does GCC or Clang not optimise reciprocal to 1 instruction when using fast-math...

c++sse compiler-optimization simd fast-math

Failed to use GNU MIPS builtin functions of vector (SIMD)...

c mips gnu simd intrinsics

C# SoA vs AoS performance...

c#performance amazon-ecs benchmarking simd

Beating or meeting OS X memset (and memset_pattern4)...

c performance optimization assembly simd

incorrect use of `simd_all` to check a compare result on all elements?...

swift simd

AVX2 repack an array of structs of 5 ints to structs of 7 ints, with the extra elements from other a...

c++simd avx2 avx512

How to disable all SIMD related feature macros in clang?...

clang simd clang++preprocessor conditional-compilation

Why do SSE instructions preserve the upper 128-bit of the YMM registers?...

performance x86 simd sse avx

How to improve performance of a packed yuv to planar yuv conversion using avx2?...

c++x86-64 simd avx2

How to best emulate the logical meaning of _mm_slli_si128 (128-bit bit-shift), not _mm_bslli_si128...

c sse simd intrinsics sse2

Logarithm with SSE, or switch to FPU?...

sse simd logarithm natural-logarithm

Fast conversion of 16-bit big-endian to little-endian in ARM...

c++arm simd neon

Too many SIMD instructions is bad?...

gcc clang simd

Is there a reason Vector64.ExtractMostSignificantBits doesn't use the pext instruction?...

c#.net x86-64 simd bmi

Optimize a separable convolution for SIMD friendly and efficiency...

c image-processing openmp simd ispc

How to use std::simd as input of SIMD intrinsics functions?...

c++simd intrinsics reinterpret-cast c++23

Pack high bit of every byte in ARM, for 64 bytes like AVX512 vpmovb2m?...

c arm simd arm64 neon

How does SIMD (avx) processing work? for example, if I want 10 32 bit floats how do i fit in a 256 b...

c simd avx

why is my simd vector plus and set slower than using std::transform and std::plus<T> - am i do...

c++vector vectorization simd avx

SSE4.1 slower than SSE3 on 4x4 matrix multiplication?...

c++matrix simd sse matmul

Why does _mm256_unpacklo "jump" a double-word and where does it says so in the documentati...

c++simd intrinsics avx2

Does SSE/AVX provide a means of determining if a result was rounded up?...

x86 rounding sse simd avx

Are SIMD and VLIW instructions the same thing?...

x86 cpu-architecture simd instruction-set vliw

SIMD load across memory boundary doesn't cause segfault?...

c++segmentation-fault undefined-behavior simd intrinsics

Best way to mask a single bit in AVX2?...

c x86 simd avx avx2

Do all processors supporting AVX2 support F16C?...

x86 x86-64 simd avx2 half-precision-float