assembly x86 stack-memory fasm stack-frame

How to avoid using PUSH without POP?

I'm currently writing x86 assembly (FASM) by hand and one typical mistake I often make is to push an argument on the stack, but return before the pop is executed.

This causes the stack offset to change for the caller, which will make the program crash.

This is a rough example to demonstrate it:

proc MyFunction
    ; A loop:
    mov     ecx, 100
.loop:
    push    ecx

    ; ==== loop content
    ...
    ; Somewhere, the decision is made to return, not just to exit the loop
    jmp    .ret
    ...
    ; ==== loop content

    pop     ecx
    loop    .loop

.ret:
    ret
endp

Now, the obvious answer is to pop the proper number of elements off the stack, before issuing a ret. However, it's easy to overlook something in 1000+ lines of handcrafted assembly.

I was also thinking about using pushad / popad always, but I'm not sure what the convention is for that.

Question: Is there any pattern that I could follow to avoid this issue?

Solution

Normally don't use push/pop inside loops; use mov like a compiler would so you're not moving ESP around unnecessarily. (That can lead to extra stack-sync uops if/when you reference ESP explicitly for other locals.)

Or in this case, just pick a different register for your two different loops, or fully keep the outer loop counter in memory after reserving some space. (sub dword [esp], 1 / jnz .outer_loop. Or [ebp-4] if you set up EBP as a frame pointer instead of just using it as another call-preserved register.)

Spilling/reloading a register around something inside a loop is inefficient. Your first step in freeing up registers should be to keep read-only things in memory, if they're not needed extremely often. e.g. an outer loop counter like inc edx / cmp edx, [esp+12] / jbe .outer_loop avoids a store/reload. Only keep mutable things in memory when you run out of registers, and then of course prefer things that aren't changed often.

In compiler-generated code, you'll normally only see pushes in the prologue, and pops along paths that lead to a ret. That makes it easy to match them up. If you need to save another call-preserved register for use inside the function, or reserve more stack space for locals, you change the sequence of pushes at the top of the function, and then change the epilogue in the return path(s).

(You can have more than one way out of a function, especially if there's not much cleanup needed then tail duplication can be better than a jmp to the other copy of the epilogue.)

You don't have to be as rigidly disciplined (or braindead) as a compiler, after all, you're writing by hand in asm to get better performance. (right? Otherwise just let a compiler do the micro-optimization for you in generating "thousands of lines" of asm! Medium to large amounts of code are where compilers really shine in their ability to quickly analyze data flow and make pretty decent code.)

So you can for example use the asm stack as a stack data structure; something you can't convince a compiler to do. (Using the callstack to implement a stack data structure in C? is an unsafe attempt though.) Like push and pop, with "empty" detection via a pointer compare. In that case you'd want to be using EBP as a frame pointer, if you have any other need for stack memory.