Search code examples
stringassemblyx86nasm

Assembly x86: LEA and MOVSB changes my Source String?


I'm doing a program in Assembler x86 (Intel 32 bits in Windows). I'm making a program (for Homework) in which I have to cipher a string, which I'll iterate by blocks conformed by two characters. I use EBX to move through the source string, increasing it by 2. For now, I didn't go into the ciphing part of the program, since I'm having trouble with problems smaller than that. Thing is, when a block has the same character, like "AA", it doesn't need to go through the cyphing process, so I have to copy "AA" to the result string as it is. This is how I do this:

CypheLoop:
call VerifyBlock
cmp byte[caracblock], 0
je End
cmp byte[caracblock], 1
je AddLastCharacter
cmp byte[caracblock], 2
je AddNoCiphedBlock
jmp CipheLoop

VerifyBlock takes care of seeing how a block is conformed, it changes "caracblock" to a number depending on its characteristics. 0 means the block is empty (Meaning that the string is over), 1 means that it's only one character in the block (For example "ABC", would have a block with just "C"), 2 if the characters of the block have to be copied as they are (As described before, another case would be if the block has a space in it), or 3 if the block needs to be ciphed. Up to now, things work great! The program adds the characters and finishes when expected, however, AddNoCiphedBlock has some unexpected behavior, it looks like this:

AddNoCiphedBlock:
    mov esi, 0
    mov edi, 0
    mov ecx, 2
    lea esi, [sourcestring + ebx]
    lea edi, [resultstring + ebx]
    rep movsb
    add ebx, 2
    jmp CipheLoop

The problem is not what it returns (Although what I got is not what I expected) but the source of the problem is that, for some reason, the source string is altered. If I write "AA", I get "AA", correct. If I write "AABB", I get "AABB", that is correct. If I write "AABBCC", I get "AABBAA". The source string, after using AddNoCiphedBlock, changes to "AABBAA", and it continues to get worse. This is what happens to the source string through the process.

AABBCC
AABBAA
AABBAA
AABBAABB
AABBAABB
AABBAABBAA

Why is this happening? I'm just copying something from the source! Both my Source String and Result String are in the .bss section as "sourcestring resd 1" and "resultstring resd 1". I use _gets to get the source string. I'm trying to give as much explanation and detail as possible, I can't even grasp the reason of why it goes wrong like that.


Solution

  • You used gets and buffers only 4 bytes long (resd 1), and you overflow them.

    When your string is 4 input characters or longer, the terminating 0 byte is outside the buffer. (gets stores 5 bytes total: the data plus a terminator. The 4 'A' bytes go in sourcestring: resd 1 and the terminating 0 is the first byte of resultstring: resd 1.

    If they're adjacent, then copying the the first 2 bytes of src to dst overwrites the 0 byte of src because it's also the first byte of dst.

    Use (much) larger buffers, and/or use a function that takes an upper limit of how many bytes to read (a buffer size).


    How to debug this:

    Once you know a byte is changing that you expect not to change, set a watch point on that address in your debugger. Then it will stop at the instruction that changes it.

    In your case, that'll be the terminating 0 byte at the end of "AABB". The address is sourcestring+4 because it comes after the 4 ASCII bytes. Then let it run and you'll see it stop at the rep movsb.

    Also, you might notice while doing this that your buffer is only 4 bytes long so sourcestring+4 is outside the 4 byte buffer, and/or that the address is the same address as resultstring.


    Code review:

    mov esi,0 is useless; you already overwrite EDI using lea. Also, using rep movsb for a fixed-size 2-byte copy is hilariously overcomplicated.

        movzx eax,  word [sourcestring + ebx]      ; 2-byte load
        mov   [resultstring + ebx], ax             ; 2-byte store
    

    Or if you insist on movs, use movsw once. (Not rep movsw so you don't have to set ECX).