I'd like to fill an array of 4096 bytes (aligned to the 4096-byte boundary) with zeros in amd64 assembly. I'm looking for both portable and single-CPU-type-only solutions.
I know that rep stosq
would do the trick, but is there anything faster? MMX? SSE? How much faster is it? How many bytes can be written to memory in a single instruction (without rep
)? We can assume that the memory cache is empty. I don't need a fully working function implementation, I just need the basic idea with its crucial assembly instruction.
I've just seen the movdqa
instruction which can write 16 bytes at a time. Is it twice as fast as 2 mov
instructions of 8 bytes each?
The answer to your question can be found by looking at the source code in the file memset64.asm
in Agner Fog's asmlib.
His code has a version for AVX and SSE. From what I can tell the code does _mm256_store_ps (vmovaps)
for some size of the array less than MemsetCacheLimit
. For larger array sizes he does non-temporal stores with _mm256_stream_ps (vmovntps)
. There are several other factors which can affect the results. See the code. You could probably get the same performance for most cases with C/C++ using intrinsic functions.
Note that the both the built-in memset function in GCC as well as the version in glibc last I checked are not optimized (which is one reason memset is in the asmlib).