Search code examples

asm (fpu) normalize - how to optimize it

I have written some x86 asm - fpu routine to normalize a vector of three floats - here it is

    _asm_normalize10:; Function begin
    push    ebp                                     ; 002E _ 55
    mov     ebp, esp                                ; 002F _ 89. E5
    mov     eax, dword [ebp+8H]                     ; 0031 _ 8B. 45, 08
    fld     dword [eax]                             ; 0034 _ D9. 00
    fmul    st0, st(0)                              ; 0036 _ DC. C8
    fld     dword [eax+4H]                          ; 0038 _ D9. 40, 04
    fmul    st0, st(0)                              ; 003B _ DC. C8
    fld     dword [eax+8H]                          ; 003D _ D9. 40, 08
    fmul    st0, st(0)                              ; 0040 _ DC. C8
    faddp   st1, st(0)                              ; 0042 _ DE. C1
    faddp   st1, st(0)                              ; 0044 _ DE. C1
    fsqrt                                           ; 0046 _ D9. FA
    fld1                                            ; 0048 _ D9. E8
    fdivrp  st1, st(0)                              ; 004A _ DE. F1
    fld     dword [eax]                             ; 004C _ D9. 00
    fmul    st(0), st1                              ; 004E _ D8. C9
    fstp    dword [eax]                             ; 0050 _ D9. 18
    fld     dword [eax+4H]                          ; 0052 _ D9. 40, 04
    fmul    st(0), st1                              ; 0055 _ D8. C9
    fstp    dword [eax+4H]                          ; 0057 _ D9. 58, 04
    fld     dword [eax+8H]                          ; 005A _ D9. 40, 08
    fmulp   st1, st(0)                              ; 005D _ DE. C9
    fstp    dword [eax+8H]                          ; 005F _ D9. 58, 08
    pop     ebp                                     ; 0062 _ 5D
    ret                                             ; 0063 _ C3
    ; _asm_normalize10 End of function

[It is my code ;-) It works and was tested by me]

I do not know x86 assembly to much and I would like to find some optimization of the above (pure fpu old asm especially without sse but somewhat more optimized than above)

Especially I wonder if there is some lame coding in this thing above: I load x y z vector on fpu stack then count 1/sqrt(x*x+y*y+z*z) then load x y z from ram again and multiply by value then store -

Is this an suboptimisation and I should try load x y z only once (not twice) then hold it on fpu stack count and then store at end ?


  • You can do exactly what you suggested and load x, y and z only once. It seems like something that could/should help. Apart from that, I don't see much opportunity for anything, assuming you still don't want to use the approximate inverse square root trick.

    Not tested:

    ; load everything
    fld dword [eax]
    fld dword [eax+4]
    fld dword [eax+8]
    ; square and add
    fld st(2)
    fmul st(0), st(0)
                      ; (see diagram 1 for fpu stack)
    fld st(2)
    fmul st(0), st(0)
                      ; (see diagram 2 for fpu stack)
    faddp st(1), st(0)
                      ; (see diagram 3 for fpu stack)
    fld st(1)
    fmul st(0), st(0)
    faddp st(1), st(0)
                      ; (see diagram 4 for fpu stack)
    ; calculate inverse sqrt
    fdivrp st(1), st(0)
    ; scale
    fmul st(1), st(0)
    fmul st(2), st(0)
    fmulp st(3), st(0)
    ; store
    fstp dword [eax+8]
    fstp dword [eax+4]
    fstp dword [eax]

    Diagram 1:

    st3: x
    st2: y
    st1: z
    st0: x * x

    Diagram 2:

    st4: x
    st3: y
    st2: z
    st1: x * x
    st0: y * y

    Diagram 3:

    st3: x
    st2: y
    st1: z
    st0: x * x + y * y

    Diagram 4:

    st3: x
    st2: y
    st1: z
    st0: x * x + y * y + z * z