assembly x86 parameter-passing callstack calling-convention

Why are parameters arranged this way on the stack when a function is called?

I'm following an OS development tutorial. There I need to implement a function that receives address (2 bytes long) of I/O port, data (1 byte long) to be sent into that port, and sends the given data to the given port.

This should be implemented on Assembly (NASM) and used in C code through defined function header. Here are solutions from tutorial:

io.s

global outb             ; make the label outb visible outside this file

; outb - send a byte to an I/O port
; stack: [esp + 8] the data byte
;        [esp + 4] the I/O port
;        [esp    ] return address
outb:
    mov al, [esp + 8]    ; move the data to be sent into the al register
    mov dx, [esp + 4]    ; move the address of the I/O port into the dx register
    out dx, al           ; send the data to the I/O port
    ret                  ; return to the calling function

io.h

#ifndef INCLUDE_IO_H
#define INCLUDE_IO_H

/** outb:
*  Sends the given data to the given I/O port. Defined in io.s
*
*  @param port The I/O port to send the data to
*  @param data The data to send to the I/O port
*/
void outb(unsigned short port, unsigned char data);

#endif /* INCLUDE_IO_H */

My question is about this part:

; stack: [esp + 8] the data byte
;        [esp + 4] the I/O port
;        [esp    ] return address

I am building for 32 bit environment, so the 4 byte difference between the address of return address and the address of I/O port makes sense - it's because of return address is 4 bytes long. But why is the difference between the addresses of I/O port and data byte also 4?

I thought that when I call a function in C it directly pushes arguments in stack, then pushes return address and jumps to function (meaning that in my understanding, data byte should be [esp + 6] (4 bytes of return address + 2 bytes of I/O port) instead of [esp + 8]), but it seems that it's also aligning parameters on 4 byte boundary, but I'm not sure about this.

Is this happening because of -m32 flag? I did read about this flag in GNU documentation and it states:

-m32
-m64
Generate code for a 32 bit or 64 bit environment. The 32 bit environment sets int, long and
pointer to 32 bits. The 64 bit environment sets int to 32 bits and long and pointer to 64
bits.

So it looks like this only changes the sizes of int / long / pointers. So why is assembly side 'sure' that parameters will be on 4 byte boundary? Is this just a convention? And if yes, why is it needed?

Here are all the flags I'm using for building:

CFLAGS = -m32 -nostdlib -nostdinc -fno-builtin -fno-stack-protector \
         -nostartfiles -nodefaultlibs -Wall -Wextra -Werror

LDFLAGS = -T link.ld -melf_i386
ASFLAGS = -f elf32

Solution

So why is assembly side 'sure' that parameters will be on 4 byte boundary? Is this just a convention?

Yes, it's a convention. What you are seeing is the IA32 cdecl calling convention, which is the default calling convention used by most compilers on IA32 (x86 32 bit).

From the GCC documentation:

cdecl

On the x86-32 targets, the cdecl attribute causes the compiler to assume that the calling function pops off the stack space used to pass arguments. This is useful to override the effects of the -mrtd switch.

This calling convention expects parameters to be pushed onto the stack by the caller, and popped afterwards. Since the push and pop instructions work with the register size, a push/pop in IA32 always causes a 4 bytes value to be pushed/popped to/from the stack. Of course, smaller values could be pushed with sub esp, x + mov, resulting in a smaller stack displacement, but it's not what this convention dictates.

And of course, argument passing could be done with other instructions too; the calling convention doesn't care how you get the data into memory above the stack pointer before a call, it just needs to be there where the callee expects to find it. Depending on optimizations, or the -mtune= settings for old CPUs, -maccumulate-outgoing-args may be enabled causing GCC to avoid using push.

And if yes, why is it needed?

It is not really needed, it's just the standard calling convention for IA32. You can specify a different calling convention if you want: just use __attribute__((xxx)) with one of the attributes defined in the documentation I linked above, and remember to update your assembly code according to the chosen calling convention.

Beware though, that if you use this approach your code will be compiler dependent (e.g. only GNU-compatible compilers will understand it, like GCC and clang), and other compilers which work by default with the IA32 cdecl convention might not recognize the attribute and error out or even fail to generate the correct code.

For example, __attribute__((regparm(3))) will get GCC to efficiently pass the first 3 arguments in registers instead of memory even in 32-bit code. The Linux kernel uses gcc -mregparm=3 for 32-bit builds because calls from user-space have to go via system calls, so there's nothing stopping the kernel from using a different calling convention than user-space.