This is the same implementation experiment from memset movq giving segfault I've been printing out memset's result, and it seems to only print out the change, and not the rest of the string as well.
experimentMemset: #memset(void *ptr, int value, size_t num)
movq %rdi, %rax #sets rax to the first pointer, to return later
.loop:
cmp $0, %edx #see if num has reached limit
je .end
movq %rsi, (%rdi) #copies value into rdi
inc %rdi #increments pointer to traverse string
subl $1, %edx #decrements count
jmp .loop
.end:
ret
int main {
char str[] = "almost every programmer should know memset!";
printf("MEMSET\n");
my_memset(str, '-', 6);
printf("%s\n", str);
}
my output: ------
correct output from cplusplus.com: ------ every programmer should know memset!
movq
is storing the high zeros in int value
, not just the low byte. This terminates the C string. And also writing past the end of the ptr+length your caller passes!
Use mov %sil, (%rdi)
to store 1 byte.
(In fact you're storing 8 bytes with movq
, including the high 4 bytes that according to the calling convention are allowed to contain garbage because they're not part of the 32-bit int value
. With this caller they'll also be zero, though.)
You could have detected this by examining memory contents with a debugger or better test harness. Do that next time. A better caller for debugging would have used write
or fwrite
to print the full buffer, and you could pipe that into hexdump -C
. Or just use GDB's x
command to dump bytes of memory.
You only check %edx
, the low 4 bytes of size_t num
in %rdx
. If you caller asks you to set exactly 4GiB of memory, you'll return without storing anything.
You can make the loop more compact by putting the conditional branch at the bottom. You could change the declaration to unsigned num
, or you could fix your code.
.globl experimentMemset
experimentMemset: #memset(void *ptr, int value, size_t num)
movq %rdi, %rax #sets rax to the first pointer, to return later
test %rdx, %rdx # special case: size = 0, loop runs zero times
jz .Lend
.Lloop: # do{
mov %sil, (%rdi) # store the low byte of int value
inc %rdi # ++ptr
dec %rdx
jnz .Lloop # }while(--count);
.Lend:
ret
It's not even any more instructions: I just pulled the cmp/jcc out of the loop to make it a skip-the-loop check, and turns the jmp
at the bottom into jcc
that reads the flags set by dec
.
Of course storing 1 byte at a time is very inefficient, even if we optimize the loop so more CPUs can run it at 1 iteration per clock. For medium-sized arrays hot in cache, modern CPUs can go 32 to 64 times faster using AVX or AVX512 stores. And can go close to that fast for aligned buffers with rep stosb
string instructions, on CPUs that have the ERMSB feature. Yes, x86 has a single instruction that implements memset
!
(Or for wider patterns, wmemset
with rep stosd
. On CPUs without ERMSB but with fast strings (PPro and later before IvyBridge), rep stosd
or stosq is faster so you might imul $0x01010101, %esi, %eax
to broadcast the low byte.)
# slowish for small or misaligned buffers
# but probably still better than a byte loop for buffers larger than maybe 16 bytes
.globl memset_ermsb
memset_ermsb: #memset(void *ptr, int value, size_t num)
mov %rdx, %rcx # count = num
mov %esi, %eax # AL = char to set
rep stosb # destination = RDI
ret
Real memset implementations use SIMD loops because that's faster for small or misaligned buffers. Much has been written about optimizing memset / memcpy. Glibc's implementations are pretty clever and a good example.
Kernel code can't use FPU/SIMD easily so rep stos
memset and rep movsb
memcpy does get used in real life in the Linux kernel.