assembly x86-64 memory-alignment calling-convention stack-pointer

What does "aligning the stack" mean in assembly?

How does stack alignment work in ASMx64? When do you need to align the stack before a function call and how much do you need to subtract?

I didn't understand what was the purpose of it. I know there are other posts about this but it wasn't clear enough for me. For example:

extern foo
global bar

section .text
bar:
  ;some code...
  sub  rsp, 8     ; Why 8 (I saw this on some posts) ? Can it be another value ? Why do we need to substract?
  call foo        ; Do we need to align stack everytime we call a function?
  add  rsp, 8
  ;some code...
  ret

Solution

Addressing is generally byte-based. A unique address points at a byte (which can be the first byte in a word or doubleword, etc, but referenced to that address).

With any numbering system the least significant digit holds the value base to the power 0 (the number 1). The next least base to the power 1, the next base to the power 2. In decimal this is the ones column the tens column the hundreds column. In binary ones, twos, fours... Alignment means evenly divisible by which also means the least significant digits are zeros.

You are always "aligned" on a byte boundary but a 16 bit boundary in binary means the least significant bit is zero, 32 bit aligned two zeros and so on.

0x1234 aligned on both a 16 and 32 bit boundary but not 64 bit
0x1235 not aligned (byte alignment really isn't a thing)
0x1236 aligned on a 16 bit boundary
0x1230 four zeros so 16, 32, 64, 128 BITS not bytes. 2,4,8,16 bytes.

The why is for performance reasons all memories have a fixed width as well as data buses, you can't magically add or remove wires in the logic once implemented, there is a physical limit, you can choose to not use all of them as part of the design but you can't add any.

So while the x86 buses are wider, let's say you had a 32 bit wide data bus as well as a 32 bit wide memory (think cache but also dram but we don't access dram directly in general).

If I want to save the 16 bits 0xAABB to address 0x1001 in a little endian machine then 0x1001 will get 0xBB and 0x1002 will get 0xAA. If I had a 32 bit data bus and a 32 bit memory on the far side of it then I could move those 16 bits if I designed the bus for this, by writing 0xXXAABBXX to address 0x1000 with a byte lane mask of 0b0110 telling the memory controller to use the 32 bits of memory associated with the BYTE based address 0x1000, and the byte lane mask on the bus telling the controller only save the middle two bytes, the outer two are don't cares.

The memory is a fixed width generally so all transactions must be full width it would read the 32 bits modify the 16 in the middle with 0xAABB and write the 32 bits back. This is of course inefficient. Even worse would be to write 0xAABB to 0x1003 that would be two bus transactions one for 0xBBXXXXXX at address 0x1000 and one for 0xXXXXXXAA at address 0x1004. That is a lot of extra cycles both on the bus and the read-modify-writes on the memory.

Now the stack alignment rules are not going to prevent read-modify-writes on writes. For the cases where larger transfers happen there are opportunities for a performance gain, for example if the bus were 32 bits and the memory and you did a 64 bit transfer to address 0x1000, that can based on bus design look like a single transfer with a length of two. The bus handshake happens then two back to back clocks the data moves, rather than handshakes and one width of the bus of data for a smaller transfer. So you get a gain there if the memory is 32 bits wide then it is two writes without a read-modify-write into the sram in the cache. Pretty clean, want to avoid the read-modify-writes.

Now do this for a while as things evolve and the hardware and the tools desire a stack alignment.

Depending on the instruction set, clearly here you are asking x86, but as a programmer you can sometimes choose to say push a byte on the stack and then adjust it to align it. Or if you are making room for local variables, depending on the instruction set (if the stack pointer is general purpose enough to be able to do math on it) you can simply subtract, so sub sp,#8 is the same as pushing two 32 bit items to the stack simply to make room for two 32 bit items.

If the rule is say 32 bit alignment and you push a byte, then you need to adjust the stack pointer by 3 to make the total change in the stack pointer a multiple of 4 bytes (32 bits).

How you know how much is you simply count it up. If it is 16 byte alignment and you push 4 then you need to push 12 more or adjust the stack pointer by 12 more.

The key here is that if everyone agrees to keep the stack aligned then you don't actually have to look at the lower bits of the stack pointer, you just keep track of what you are pushing and popping before calling something else.

If the stack is shared with the interrupt handlers (not really in your current x86 running an operating system, but still possible and possible in many other use cases for general purpose processors) I have not seen that this rule applies there as you will see the compiler do a less than aligned size push or pop then adjust with other pushes or pops or subtraction or addition. If an interrupt occurred between those the handler would see an unaligned stack.

Some architectures will fault on unaligned accesses, a further reason for keeping the stack aligned.

If your code is not messing with the stack then you don't need to mess with the stack (pointer). Only if you use the stack in your code by allocating space on the stack (pushes or math on the stack pointer), do you need to care and you need to know what the convention of the compiler you are linking this code with and conform to that. If this is all assembly language and no compiler then you decide the convention yourself and basically do whatever you want within the limitations of the processor itself.

From your title question it has nothing to do with assembly at all, nor machine code. It has to do with your code and what it does. The assembly language is simply a language in which you convey how much you want to adjust the stack pointer, the instruction doesn't care or know about any such things it takes the constant provided and uses it against the register. Assembly is one of the few if not the only that allows you to do math on the stack pointer register, so there is that connection. But alignment and assembly are not related.