x86 Assembly How to properly get XMM0 into ST0?

A wonderful Sunday everyone.

I am currently learning a lot of assembly in the 32-bit environment (currently Windows). I am using FASM for this.

I have the following code which I successfully made but I'm quite unhappy with the way I load XMM0 into ST0:

GetDistance: ;(__cdecl*)(float x1, float y1, float x2, float y2)
        push    ebp
        mov     ebp, esp        
        sub     esp, 0x4
        
        movss   xmm0, DWORD [ebp + 0x0014] ; Load x2
        subss   xmm0, DWORD [ebp + 0x000C] ; Subtract x1

        movss   xmm1, DWORD [ebp + 0x0010] ; Load y2
        subss   xmm1, DWORD [ebp + 0x0008]  ; Subtract y1

        mulss   xmm0, xmm0             ; Square of the x difference
        mulss   xmm1, xmm1             ; Square of the y difference

        addss   xmm0, xmm1             ; Sum of squared differences

        sqrtss  xmm0, xmm0             ; Square root
                                 
        movss   dword [ebp - 0x0004], xmm0
        fld     dword [ebp - 0x0004]    
        
        add     esp, 0x4
        
        pop     ebp
        ret     0

It does work but I have been googling for a straight 2 hours now (even asked ChatGPT) on how to get my XMM0 value into ST0 but I fail to search for the correct problem I guess and ChatGPT's answers always created compile errors or made my function return 'NAN'. ChatGPT converted my simple function always to an executable main block which uses .data section and therefore global variables and I think it leads me into a complete wrong direction.

I don't like that I had to use sub from and add to ESP to get XMM0 into ST0.

I also appreciate any tips to improve my code or even good resources to learn from it. I only want to focus 32-bit for now. :)

Solution

Store/reload is necessary to transfer from XMM to st0. Even though MMX registers alias the x87 registers, there's no way to use MOVDQ2Q mm0, xmm0 to get an 80-bit FP bit-pattern into st0, even apart from the problem of switching back from MMX to x87 state without clearing the registers.

You don't need to waste instructions setting up EBP as a frame pointer, though, especially in simple functions like this where it's easy enough to keep track of offsets relative to ESP.

In a function with stack args, the callee (your function) "owns" them, so you can use [esp+4] as scratch space instead of reserving new space. This is why, when calling the same function twice with the same args, the caller has to store the args again. e.g.

square:                   ; float square(float a); legacy cdecl convention
 movss  xmm0, [esp+4]
 mulss  xmm0, xmm0
 movss  [esp+4], xmm0      ; reuse the incoming arg as scratch space
 fld    dword [esp+4]
 ret

In this case it would have been more efficient to use fld dword [esp+4] / fmul st0 / ret because we're using a calling convention that returns in st0.

If you insist on using 32-bit code, then the default calling-conventions are old and bad, passing args on the stack and returning float/double in st0 instead of xmm0.

For Windows there are less bad 32-bit calling conventions, though. 32-bit vectorcall passes the first 6 FP (or SIMD vector) args in xmm registers, and returns in xmm0. And the first 2 integer args in regs like fastcall. (64-bit vectorcall only passes 4 args in XMM regs, differing from the standard Windows x64 convention only in handling types like __m128i and __m256.) See https://learn.microsoft.com/en-us/cpp/cpp/vectorcall?view=msvc-170 for more.

float _vectorcall 
 foo(float a, float b, float c, float d, float e, float f, float g, int i){
    return a+b+c+d+e+f+g + i;
}

Compiles with x86 MSVC 19.10 (Godbolt). It's a callee-pops convention like fastcall; note the ret 4 since we have one stack arg. If you don't have any stack args, though, just a normal ret is still correct.

_g$ = 8                                       ; size = 4
float foo(float,float,float,float,float,float,float,int) PROC                                ; foo, COMDAT
        addss   xmm0, xmm1
        movd    xmm1, ecx
        cvtdq2ps xmm1, xmm1          ; avoids a false dependency vs. cvtsi2ss xmm1, ecx which is also 2 uops
        addss   xmm0, xmm2
        addss   xmm0, xmm3
        addss   xmm0, xmm4
        addss   xmm0, xmm5
        addss   xmm0, DWORD PTR _g$[esp-4]   ; 7th FP arg comes from the stack.
                                             ; with _g$ = 8, this is actually [esp+4]
        addss   xmm0, xmm1                   ; +i  converted earlier
        ret     4
float foo(float,float,float,float,float,float,float,int) ENDP                                ; foo

If your callers are also hand-written asm, then you don't have to follow a standard calling convention; you can pass/return args in convenient registers and document it with comments on a per-function basis.

ChatGPT's answers always created compile errors or made my function return 'NAN'. ChatGPT converted my simple function always to an executable main block which uses .data section and therefore global variables and I think it leads me into a complete wrong direction.

Unsurprising; ChatGPT is very bad at assembly language, buggy code is normal.
It doesn't "understand" what it's doing in any language, but x86 asm was probably rarer in its training data and/or harder for large language models because the same register names and mnemonics get used in all programs. And there are so many different flavours of assembly language (including multiple for x86) that probably doesn't help.