intrinsics Examples and Free Source Code

What's the point of _mm_cmpgt_sd and other similar methods?...

x86 sse simd intrinsics

What is the difference between _mm_movehdup_ps and _mm_shuffle_ps in this case?...

x86 sse intrinsics micro-optimization sse3

Loading into Array causes Stack Smashing while having enough space?...

c++intrinsics avx avx512 stack-smash

How do I efficiently reorder bytes of a __m256i vector (convert int32_t to uint8_t)?...

c++vectorization simd intrinsics avx2

Compiler errors for GCC (via CUDA) intrinsic functions, but I'm not using any...

c++gcc compiler-errors cuda intrinsics

Summing vec4[idx[i]] * scalar[i] with YMM vector registers...

c++simd intrinsics avx2

SSE: shuffle (permutevar) 4x32 integers...

sse simd intrinsics avx

Convert AoS to SoA in C using SIMD...

c arrays struct simd intrinsics

Most efficient way to check if all __m128i components are 0 [using <= SSE4.1 intrinsics]...

c++integer sse simd intrinsics

Unresolved external symbol __aullshr when optimization is turned off...

c visual-c++intrinsics bit-fields uefi

Segmentation fault (core dumped) when using avx on an array allocated with new[]...

c++11 codeblocks intrinsics avx

Missing AVX-512 intrinsics for masks?...

c gcc intrinsics icc avx512

__m256 unknown type (clang 5.1/i5 CPU)?...

c++x86 clang++intrinsics avx

How does dead code elimination of Math.log() work in JMH sample...

java intrinsics microbenchmark jmh

Computing 8 horizontal sums of eight AVX single-precision floating-point vectors...

optimization intrinsics avx low-level

cuda "rounding modes" of reciprocal functions...

api math cuda intrinsics

Why does _mm_mfence() produce counts for the ALL_LOADS perf event?...

c x86 intrinsics perf papi

How to detect rdtscp support in Visual C++?...

c++visual-c++x86 intrinsics rdtsc

What is the difference between loadu and load?...

assembly x86 sse simd intrinsics

unresolved external symbol __mm256_setr_epi64x...

c++visual-studio-2012 intrinsics avx msvc12

_mm_lfence() time overhead is non deterministic?...

c performance x86 intrinsics rdtsc

How to move double in %rax into particular qword position on %ymm or %zmm? (Kaby Lake or later)...

c++x86-64 inline-assembly intrinsics avx

FMA instruction showing up as three packed double operations?...

linear-algebra intrinsics perf

Why __m256 instead of 'float' gives more than x8 performance?...

c++visual-c++compiler-optimization sse intrinsics

How to floor/int in double using only SSE2?...

c++simd truncate intrinsics sse2

What's the difference between __popcnt() and _mm_popcnt_u32()?...

x86 sse intrinsics sse4

ARM SVE Left-to-right vs. tree reduction...

arm intrinsics sve

What is the fastest way to convert a large c-array of char8 to short16?...

c++c intel intrinsics

How do you process exp() with SSE2?...

c++simd intrinsics sse2 exp

Move an int64_t to the high quadwords of an AVX2 __m256i vector...

c++x86-64 simd intrinsics avx2