simd Examples and Free Source Code

Horizontal XOR in AVX...

c++assembly x86 simd avx

Divide 8-bit integers by 4 (or shift) using SSE...

c++x86 sse simd intrinsics

How to achieve peak flop throughput for FMA when using input data (while maintaining the required ro...

c++performance x86 compiler-optimization simd

Which operations in numpy uses SIMD?...

numpy simd

SIMD intrinsics: aligned operation different than unaligned?...

c++x86 simd intrinsics

inlining failed in call to always_inline ‘_mm_mullo_epi32’: target specific option mismatch...

c cmake x86 sse simd

Fastest Implementation of the Natural Exponential Function Using SSE...

c optimization vectorization sse simd

Avoid Frequency Scaling for SIMD FMA Performance...

c++performance x86 cpu simd

How to simulate pcmpgtq on sse2?...

assembly sse simd sse2 sse4

What is the most efficient way to do unsigned 64 bit comparison on SSE2?...

assembly sse simd sse2

Using a variable to index a simd vector with _mm256_extract_epi32() intrinsic...

simd intrinsics avx avx2

Modulo on ARM SIMD Aarch64 (NEON)...

c assembly simd arm64

Optimal instruction sequence for AVX512 gather of 4D vectors...

c++vectorization intel simd avx512

Set Last Value in __m128 vector register...

c++simd sse avx

Is there anything more I need to do before using SSE instructions?...

assembly x86 simd sse avx

Does browser JavaScript allow for SIMD or Vectorized operations?...

javascript matrix vector vectorization simd

Visual Studio not recognizing __AVX2__ or __AVX__...

c++visual-c++cmake macros simd

Understanding throughput of simd sum implementation x86...

x86 simd

print a __m128i variable...

c assembly sse simd intrinsics

How to load uint8_t "as" 32 bits integer efficiently into a SIMD register?...

c++simd avx512

Extract icons from exe in Rust?...

windows winapi rust simd bevy

Dot-product groups of 4 bytes against 4 small constants, over an array of bytes (efficiently using S...

c#c assembly masm simd

Is my understanding of AoS vs SoA advantages/disadvantages correct?...

caching memory sse simd data-oriented-design

How to solve the 32-byte-alignment issue for AVX load/store operations?...

c++sse simd memory-alignment avx

AVX2 vectorization for code similar to prefix sum (decrement by count of preceding matches in short ...

simd avx bitmask avx2 prefix-sum

Is using AVX2 can implement a faster processing of LZCNT on a word array?...

x86 simd avx micro-optimization avx2

Dot product performance with SSE instructions: is DPPS worth using?...

assembly x86 simd sse dot-product

simd find first element greater than x...

c++simd avx512

Reducing NEON vector with variable amounts of bits in each element into a single 32-bit value (conca...

c++bit-manipulation simd arm64 neon

Why does GCC generate code that conditionally executes a SIMD implementation?...

c++gcc simd auto-vectorization