Search code examples
4-way bytewise interleave 4x 16-byte vectors from memory, with AVX512...


x86x86-64micro-optimizationavx512

Read More
Insert a bit in a byte at pos n with Assembly...


assemblyarmmicro-optimizationcortex-m

Read More
Why jnz requires 2 cycles to complete in an inner loop...


x86micro-optimizationmicrobenchmarkmicro-architecture

Read More
Cycles/cost for L1 Cache hit vs. Register on x86?...


performancex86cpu-architecturecpu-cachemicro-optimization

Read More
When to use a certain calling convention...


assemblyx86x86-64calling-conventionmicro-optimization

Read More
Is there a penalty when base+offset is in a different page than the base?...


performanceassemblyx86micro-optimization

Read More
Why does GCC chose dword movl to copy a long shift count to CL?...


assemblygccx86-64micro-optimization

Read More
Is there any performance difference in using int versus int8_t...


ctypesmicro-optimization

Read More
Does cmpxchg write destination cache line on failure? If not, is it better than xchg for spinlock?...


assemblyx86cpu-cachemicro-optimizationcompare-and-swap

Read More
What is the optimal way for reading the contents of a webpage into a string in Java?...


javastringoptimizationinputstreammicro-optimization

Read More
Why this unnecessary MOVAPD copy in gcc 9.1, in a tiny function...


assemblygccx86-64ssemicro-optimization

Read More
In x86-64 asm: is there a way of optimising two adjacent 32-bit stores / writes to memory if the sou...


assemblyoptimizationx86-64micro-optimization

Read More
Instructions to copy the low byte from an int to a char: Simpler to just do a byte load?...


cassemblyx86-64micro-optimizationinstructions

Read More
Can this MIPS assembly code be simplified?...


stringassemblymipsmicro-optimizationsimplify

Read More
Avoiding AVX-SSE (VEX) Transition Penalties...


assemblyx86sseavxmicro-optimization

Read More
Is movzbl followed by testl faster than testb?...


performanceassemblyx86x86-64micro-optimization

Read More
An implementation of std::atomic_thread_fence(std::memory_order_seq_cst) on x86 without extra perfor...


c++x86inline-assemblymicro-optimizationmemory-barriers

Read More
What is the fastest way to swap the bytes of an unaligned 64 bit value in memory?...


performanceassemblyx86-64endiannessmicro-optimization

Read More
Should combining memory fence for mutex acquire-exchange loop (or queue acquire-load loop) be done o...


armcpu-architecturemicro-optimizationmemory-barriers

Read More
How to instruct MS Visual C++ compiler to use an uninitialized __m512i register...


c++visual-c++intrinsicsmicro-optimizationavx512

Read More
Do java finals help the compiler create more efficient bytecode?...


javaoptimizationmicro-optimization

Read More
How can one figure out if a loop is being entered with a 16 byte aligned address in x86-64 assembly?...


assemblyoptimizationx86-64memory-alignmentmicro-optimization

Read More
fastest way to negate a number...


c++visual-c++x86micro-optimizationvisual-c++-2012

Read More
repz ret: why all the hassle?...


assemblyx86micro-optimizationamd-processorbranch-prediction

Read More
What does `rep ret` mean?...


assemblyx86micro-optimizationbranch-prediction

Read More
Is calling `add` on a memory location faster than calling it on a register and then moving the value...


assemblyx86x86-64micro-optimization

Read More
Fast method to copy memory with translation - ARGB to BGR...


cx86rgbssemicro-optimization

Read More
Impact on performance when having multiple returns...


assemblyx86micro-optimization

Read More
80286: Which is the fastest way to multiply by 10?...


assemblyx86-16micro-optimization

Read More
What do multiple values or ranges means as the latency for a single instruction?...


performanceassemblyx86cpu-architecturemicro-optimization

Read More
BackNext