Search code examples
assemblyx86-64gnu-assemblerattstrlen

How to traverse a string in assembly until I reach null? (strlen loop)


Right now I'm just figuring out how to even traverse a string. If the code doesn't make sense, it's because I've interpreted some information wrong. At the worst, I don't really know what I'm doing.

strlen:

pushq %rbx
movq %rsi, %rbx


loop:
    cmp $0x00, (%rdi, %rbx)
    je end
    inc %rbx
    jmp loop

end:
    movq %rbx, %rax
    popq %rbx
    ret

PS: There's a reason my title looks like an old man 2nd time on his computer trying to search "how to go to google.com" Superrrr noob here trying to learn a little bit of assembly. I'm trying to implement strlen function for myself.


Solution

  • You just inc %rbx to increment the pointer value. (%rbx) dereferences that register, using its value as a memory address. On x86, every byte has its own address (this property is called "byte addressable"), and addresses are just integers that fit in a register.

    The characters in an ASCII string are all 1 byte wide so incrementing a pointer by 1 moves to the next character in an ASCII string. (This isn't true in the general case of UTF-8 with characters outside the 1..127 range of codepoints, but ASCII is a subset of UTF-8.)


    Terminology: ASCII code 0 is called NUL (one L), not NULL. In C, NULL is a pointer concept. C-style implicit-length strings can be described as 0-terminated or NUL-terminated, but "null-terminated" is misusing the terminology.


    You should pick a different register (one that's call-clobbered) so you don't need to push/pop it around your function. Your code doesn't make any function calls, so there's no need to keep your induction variable in a call-preserved register.

    I didn't find a good simple example in other SO Q&As. They either have 2 branches inside the loop (including one unconditional jmp) like the one I linked in comments, or they waste instructions incrementing a pointer and a counter. Using an indexed addressing mode inside the loop is not terrible, but is less efficient on some CPUs so I'd still recommend doing a pointer increment -> subtract end-start after the loop.

    This is how I'd write a minimal strlen that only checks 1 byte at a time (slow and simple). I kept the loop itself small, and this is IMO a reasonable example of a good way of writing loops in general. Often keeping your code compact makes it easier to understand a function in asm. (Give it a different name than strlen so you can test it without needing gcc -fno-builtin-strlen or whatever.)

    .globl simple_strlen
    simple_strlen:
        lea     -1(%rdi), %rax     # p = start-1 to counteract the first inc
     .Lloop:                       # do {
        inc     %rax                  # ++p
        cmpb    $0, (%rax)
        jne     .Lloop             # }while(*p != 0);
                               # RAX points at the terminating 0 byte = one-past-end of the real data
        sub     %rdi, %rax     # return length = end - start
        ret
    

    The return value of strlen is the array index of the 0 byte = length of the data not including the terminator.

    If you were inlining this manually (because it's just a 3-instruction loop), you'd often just want a pointer to the 0 terminator so you wouldn't bother with the sub crap, just use RAX at the end of the loop.

    Avoiding the offsetting LEA/INC instructions before the first load (which cost 2 cycles of latency before the first cmp) could be done by peeling the first iteration, or with a jmp to enter the loop at the cmp/jne, after the inc. Why are loops always compiled into "do...while" style (tail jump)?.

    Incrementing the pointer with LEA between cmp/jcc (like cmp ; lea 1(%rax), %rax ; jne) could be worse because it defeats macro-fusion of cmp/jcc into a single uop. (Actually, macro-fusion of cmp $imm, (%reg) / jcc doesn't happen on Intel CPUs like Skylake anyway. cmp micro-fuses the memory operand, though. Maybe AMD fuses the cmp/jcc.) Also, you'd leave the loop with RAX 1 higher than you want.

    So it would be just as efficient (on Intel Sandybridge-family) to movzx (aka movzbl) load and zero-extend a byte into %ecx, and test %ecx, %ecx / jnz as the loop condition. But larger code-size.


    Most CPUs will run my loop at 1 iteration per clock cycle. We could maybe get close to 2 bytes per cycle (while still only checking each byte separately) with some loop unrolling.

    Checking 1 byte at a time is about 16x slower for large strings than we could go with SSE2. If you aren't aiming for minimal code size and simplicity, see Why is this code 6.5x slower with optimizations enabled? for a simple SSE2 strlen that uses an XMM register. SSE2 is baseline for x86-64 so you should always use it when it gives a speedup, for stuff that's worth writing by hand in asm.


    Re: your updated question with a buggy port of the implementation from Why does rax and rdi work the same in this situation?

    RDI and RBX both hold pointers. Adding them together doesn't make a valid address! In the code you were trying to port, RCX (the index) is initialized to zero before the loop. But instead of xor %ebx, %ebx, you did mov %rdi, %rbx. Use a debugger to examine register values while you single-step your code.