Search code examples
assemblyx86inline-assembly

How can I copy a string from a source to destination using in line assembly?


I'm new to assembly and I'm trying to copy a string from an input char const char* source into another string given in the input parameter, char* destination and I have to do it via in line assembly x86, and here is my code:

Note: volatile marks that the variable/code region can change unexpectedly from an external source.

void samestring(const char* start, char* end) {
    asm volatile (
        "mov %[src], %%rsi\n"   
        "mov %[dest], %%rdi\n"  
        "xor %%al, %%al\n"      
          
        "inc %%rdi\n"           
        "cmpb $0, %%dl\n"       
        "jne copy_loop\n"       

        :                                                  
        : "memory", "%rsi", "%rdi", "%rax", "%rdx"          
        );
}

This is the code that I found from a reddit post about a similar problem, and since I'm new to assembly, I don't really know if this method is efficient or whether there are ways I can improve this code or not, so I would like to consult experts of assembly to help tell me about what I can and should edit in the code above to make it less time consuming,

any help would be greatly appreciated.


Solution

  • That's hilariously inefficient, including the way it gets operands into the asm statement, but also the loop itself copying 1 byte at a time.

    If you care about efficiency for x86-64, you should be using SSE2 to load and check 16 bytes at a time like glibc's hand-written asm for strcpy. (Or AVX2 for 32 bytes). https://codebrowser.dev/glibc/glibc/sysdeps/x86_64/strcpy.S.html - note that it has to reach an alignment boundary first, e.g. check that the pointer isn't in the last 16 bytes of a page and then do one unaligned vec, as with strlen

    Unless you're optimizing for string lengths of maybe 0 to 5 bytes, without caring at all about performance for long strings. With AVX-512 masked stores (efficient on Intel, very slow on AMD Zen 4), vectors might be an efficient way to handle even short strings, with no risk of branch mispredict based on different short lengths since every string less than 32 bytes branches the same way.


    Inline asm details

    This forces the compiler to store the pointers to memory ("m" constraint) so the asm template can reload them, instead of asking for them in "+S" (RSI) and "+D" (RDI) registers, or better the compiler's choice of registers with [src] "+r"(source) etc.

    It also zeros AL inefficiently for no reason, and has a false dependency on RDX by loading with movb instead of movzbl (%[src]), %%edx (How to load a single byte from address in assembly)

    test %dl, %dl is a more efficient way to set FLAGS than cmpb $0, %dl.

    Other than that, the loop itself is naive but not too bad if you want to keep it simple as a beginner exercise and only copy 1 byte at a time.