Search code examples
assemblyx86-64attmemset

Result of my implementation of memset only prints the changes, and not the entire result string


This is the same implementation experiment from memset movq giving segfault I've been printing out memset's result, and it seems to only print out the change, and not the rest of the string as well.

experimentMemset:   #memset(void *ptr, int value, size_t num)

    movq %rdi, %rax     #sets rax to the first pointer, to return later

.loop:
        cmp $0, %edx    #see if num has reached limit
        je .end                

        movq %rsi, (%rdi)       #copies value into rdi
        inc %rdi                #increments pointer to traverse string
        subl $1, %edx   #decrements count
        jmp .loop
.end:
        ret



int main {

 char str[] = "almost every programmer should know memset!";
    printf("MEMSET\n");
    my_memset(str, '-', 6);
    printf("%s\n", str);

}

my output: ------

correct output from cplusplus.com: ------ every programmer should know memset!


Solution

  • movq is storing the high zeros in int value, not just the low byte. This terminates the C string. And also writing past the end of the ptr+length your caller passes!

    Use mov %sil, (%rdi) to store 1 byte.

    (In fact you're storing 8 bytes with movq, including the high 4 bytes that according to the calling convention are allowed to contain garbage because they're not part of the 32-bit int value. With this caller they'll also be zero, though.)

    You could have detected this by examining memory contents with a debugger or better test harness. Do that next time. A better caller for debugging would have used write or fwrite to print the full buffer, and you could pipe that into hexdump -C. Or just use GDB's x command to dump bytes of memory.


    You only check %edx, the low 4 bytes of size_t num in %rdx. If you caller asks you to set exactly 4GiB of memory, you'll return without storing anything.


    You can make the loop more compact by putting the conditional branch at the bottom. You could change the declaration to unsigned num, or you could fix your code.

    .globl experimentMemset
    experimentMemset:   #memset(void *ptr, int value, size_t num)
    
        movq %rdi, %rax     #sets rax to the first pointer, to return later
    
        test  %rdx, %rdx    # special case: size = 0, loop runs zero times
        jz    .Lend
    .Lloop:                   # do{
          mov   %sil, (%rdi)     # store the low byte of int value
          inc   %rdi             # ++ptr
          dec   %rdx
          jnz  .Lloop         # }while(--count);
    .Lend:
        ret
    

    It's not even any more instructions: I just pulled the cmp/jcc out of the loop to make it a skip-the-loop check, and turns the jmp at the bottom into jcc that reads the flags set by dec.


    Efficiency

    Of course storing 1 byte at a time is very inefficient, even if we optimize the loop so more CPUs can run it at 1 iteration per clock. For medium-sized arrays hot in cache, modern CPUs can go 32 to 64 times faster using AVX or AVX512 stores. And can go close to that fast for aligned buffers with rep stosb string instructions, on CPUs that have the ERMSB feature. Yes, x86 has a single instruction that implements memset!

    (Or for wider patterns, wmemset with rep stosd. On CPUs without ERMSB but with fast strings (PPro and later before IvyBridge), rep stosd or stosq is faster so you might imul $0x01010101, %esi, %eax to broadcast the low byte.)

    # slowish for small or misaligned buffers
    # but probably still better than a byte loop for buffers larger than maybe 16 bytes
    .globl memset_ermsb
    memset_ermsb:   #memset(void *ptr, int value, size_t num)
    
        mov  %rdx, %rcx            # count = num
        mov  %esi, %eax            # AL = char to set
        rep  stosb                 # destination = RDI 
        ret
    

    Real memset implementations use SIMD loops because that's faster for small or misaligned buffers. Much has been written about optimizing memset / memcpy. Glibc's implementations are pretty clever and a good example.

    Kernel code can't use FPU/SIMD easily so rep stos memset and rep movsb memcpy does get used in real life in the Linux kernel.