Search code examples
Is it really efficient to use Karatsuba algorithm in 64-bit x 64-bit multiplication?...


c++performanceparallel-processingsimdavx2

Read More
Fastest way to multiply an array of int64_t?...


cvectorizationmultiplicationavxavx2

Read More
How to align/rotate a 256 bit vector in AVX2?...


rustsimdintrinsicsavxavx2

Read More
Fast __m256i bit operations - find or clear highest or lowest set bit...


x86bit-manipulationsimdavxavx2

Read More
How to force gcc to use avx2 for copying a 32-byte struct with shared between threads?...


cassemblyx86-64avxavx2

Read More
Transform random integers into range [min,max] without branching...


c++bit-manipulationsimdavx2branchless

Read More
Unpack 12-bit data quickly (where the nibbles aren't contiguous; how to shuffle nibbles?)...


c#c++avxavx2pixelformat

Read More
x86-64 SIMD mechanism to "compare" 8-bit unsigned integers, giving a vector of +1 / 0 / -1...


simdavxavx2avx512

Read More
How to chain avx2 intrinsics efficiently to perform chain of arithmetic operations?...


gccoptimizationvectorizationintrinsicsavx2

Read More
Am I missing a target-feature for AVX512 when I compile my Rust code?...


rustsimdrust-cargoavx2avx512

Read More
SSE Vector Comparison with Epsilon...


coptimizationsseavxavx2

Read More
Efficient (on Ryzen) way to extract the odd elements of a __m256 into a __m128?...


c++vectorizationx86-64sseavx2

Read More
How to pack +-1 signs of 8 packed 32-bit integers (in an __m256i) into bytes of a 64-bit integer?...


c++performancesimdintrinsicsavx2

Read More
Seg fault while using _mm256_i64gather_pd...


c++intrinsicsavxavx2

Read More
perf report shows this function "__memset_avx2_unaligned_erms" has overhead. does this mea...


c++profilingavxperfavx2

Read More
SIMD bit reordering of packed 12-bit integer array...


csimdneonavx2pixelformat

Read More
AVX2 code cannot be faster than gcc base optmization...


c++performancex86-64micro-optimizationavx2

Read More
Xcode Apple Clang enable avx512...


xcodeclangavxavx2avx512

Read More
What's the fastest way to perform an arbitrary 128/256/512 bit permutation using SIMD instructio...


c++assemblysseavxavx2

Read More
SIMD Intrinsics AVX. Tried to use _mm256_mullo_epi64. But got 0xC000001D: Illegal Instruction except...


c++exceptionsimdavxavx2

Read More
Disabling AVX2 in CPU for testing purposes...


testingx86avxinstruction-setavx2

Read More
How to implement an efficient _mm256_madd_epi8 dot-products of groups of four i8 elements?...


c++x86simdintrinsicsavx2

Read More
Shifting values of avx2 packed single vector...


c++performanceavx2

Read More
How to convert 32-bit float to 8-bit signed char? (4:1 packing of int32 to int8 __m256i)...


cx86simdintrinsicsavx2

Read More
Convert 128 bit AVX register with 8-bit elements to two 256 bit registers with 32-bit elements...


performancex86simdavxavx2

Read More
Does Zen 4 core have 48 flops per cycle for 32-bit precision fp?...


performancex86-64cpu-architectureavx2amd-processor

Read More
How to gather arbitrary indexes in VCL with AVX2 enabled...


c++x86vectorizationavx2vector-class-library

Read More
Fastest way to implement _mm256_mullo_epi4 using AVX2...


cx86-64intrinsicsavxavx2

Read More
What is the fastest way to calculate the logical_and (&&) between elements of two __m256i va...


c++simdavxavx2logical-and

Read More
How to load 128bit data to ymm register in assembly?...


assemblyx86avxavx2

Read More
BackNext