Search code examples
assemblyx86reverse-engineeringstrlen

REPNZ SCAS Assembly Instruction Specifics


I am trying to reverse engineer a binary and the following instruction is confusing me, can anyone clarify what exactly this does?

=>0x804854e:    repnz scas al,BYTE PTR es:[edi]
  0x8048550:    not    ecx

Where:

EAX: 0x0
ECX: 0xffffffff
EDI: 0xbffff3dc ("aaaaaa\n")
ZF:  1

I see that it is somehow decrementing ECX by 1 each iteration, and that EDI is incrementing along the length of the string. I know it calculates the length of the string, but as far as exactly HOW it's happening, and why "al" is involved I'm not quite sure.


Solution

  • I'll try to explain it by reversing the code back into C.

    Intel's Instruction Set Reference (Volume 2 of Software Developer's Manual) is invaluable for this kind of reverse engineering.

    REPNE SCASB

    The logic for REPNE and SCASB combined:

    while (ecx != 0) {
        temp = al - *(BYTE *)edi;
        SetStatusFlags(temp);
        if (DF == 0)   // DF = Direction Flag
            edi = edi + 1;
        else
            edi = edi - 1;
        ecx = ecx - 1;
        if (ZF == 1) break;
    }
    

    Or more simply:

    while (ecx != 0) {
        ZF = (al == *(BYTE *)edi);
        if (DF == 0)
            edi++;
        else
            edi--;
        ecx--;
        if (ZF) break;
    }
    

    String Length

    However, the above is insufficient to explain how it computes the length of a string. Based on the presence of the not ecx in your question, I'm assuming the snippet belongs to this idiom (or similar) for computing string length using REPNE SCASB:

    sub ecx, ecx
    sub al, al
    not ecx
    cld
    repne scasb
    not ecx
    dec ecx
    

    Translating to C and using our logic from the previous section, we get:

    ecx = (unsigned)-1;
    al = 0;
    DF = 0;
    while (ecx != 0) {
        ZF = (al == *(BYTE *)edi);
        if (DF == 0)
            edi++;
        else
            edi--;
        ecx--;
        if (ZF) break;
    }
    ecx = ~ecx;
    ecx--;
    

    Simplifying using al = 0 and DF = 0:

    ecx = (unsigned)-1;
    while (ecx != 0) {
        ZF = (0 == *(BYTE *)edi);
        edi++;
        ecx--;
        if (ZF) break;
    }
    ecx = ~ecx;
    ecx--;
    

    Things to note:

    • in two's complement notation, flipping the bits of ecx is equivalent to -1 - ecx.
    • in the loop, ecx is decremented before the loop breaks, so it decrements by length(edi) + 1 in total.
    • ecx can never be zero in the loop, since the string would have to occupy the entire address space.

    So after the loop above, ecx contains -1 - (length(edi) + 1) which is the same as -(length(edi) + 2), which we flip the bits to give length(edi) + 1, and finally decrement to give length(edi).

    Or rearranging the loop and simplifying:

    const char *s = edi;
    size_t c = (size_t)-1;      // c == -1
    while (*s++ != '\0') c--;   // c == -1 - length(s)
    c = ~c;                     // c == length(s)
    

    And inverting the count:

    size_t c = 0;
    while (*s++ != '\0') c++;
    

    which is the strlen function from C:

    size_t strlen(const char *s) {
        size_t c = 0;
        while (*s++ != '\0') c++;
        return c;
    }