I have written some x86 asm - fpu routine to normalize a vector of three floats - here it is
_asm_normalize10:; Function begin
push ebp ; 002E _ 55
mov ebp, esp ; 002F _ 89. E5
mov eax, dword [ebp+8H] ; 0031 _ 8B. 45, 08
fld dword [eax] ; 0034 _ D9. 00
fmul st0, st(0) ; 0036 _ DC. C8
fld dword [eax+4H] ; 0038 _ D9. 40, 04
fmul st0, st(0) ; 003B _ DC. C8
fld dword [eax+8H] ; 003D _ D9. 40, 08
fmul st0, st(0) ; 0040 _ DC. C8
faddp st1, st(0) ; 0042 _ DE. C1
faddp st1, st(0) ; 0044 _ DE. C1
fsqrt ; 0046 _ D9. FA
fld1 ; 0048 _ D9. E8
fdivrp st1, st(0) ; 004A _ DE. F1
fld dword [eax] ; 004C _ D9. 00
fmul st(0), st1 ; 004E _ D8. C9
fstp dword [eax] ; 0050 _ D9. 18
fld dword [eax+4H] ; 0052 _ D9. 40, 04
fmul st(0), st1 ; 0055 _ D8. C9
fstp dword [eax+4H] ; 0057 _ D9. 58, 04
fld dword [eax+8H] ; 005A _ D9. 40, 08
fmulp st1, st(0) ; 005D _ DE. C9
fstp dword [eax+8H] ; 005F _ D9. 58, 08
pop ebp ; 0062 _ 5D
ret ; 0063 _ C3
; _asm_normalize10 End of function
[It is my code ;-) It works and was tested by me]
I do not know x86 assembly to much and I would like to find some optimization of the above (pure fpu old asm especially without sse but somewhat more optimized than above)
Especially I wonder if there is some lame coding in this thing above: I load x y z vector on fpu stack then count 1/sqrt(x*x+y*y+z*z) then load x y z from ram again and multiply by value then store -
Is this an suboptimisation and I should try load x y z only once (not twice) then hold it on fpu stack count and then store at end ?
You can do exactly what you suggested and load x
, y
and z
only once. It seems like something that could/should help. Apart from that, I don't see much opportunity for anything, assuming you still don't want to use the approximate inverse square root trick.
Not tested:
; load everything
fld dword [eax]
fld dword [eax+4]
fld dword [eax+8]
; square and add
fld st(2)
fmul st(0), st(0)
; (see diagram 1 for fpu stack)
fld st(2)
fmul st(0), st(0)
; (see diagram 2 for fpu stack)
faddp st(1), st(0)
; (see diagram 3 for fpu stack)
fld st(1)
fmul st(0), st(0)
faddp st(1), st(0)
; (see diagram 4 for fpu stack)
; calculate inverse sqrt
fsqrt
fld1
fdivrp st(1), st(0)
; scale
fmul st(1), st(0)
fmul st(2), st(0)
fmulp st(3), st(0)
; store
fstp dword [eax+8]
fstp dword [eax+4]
fstp dword [eax]
Diagram 1:
st3: x
st2: y
st1: z
st0: x * x
Diagram 2:
st4: x
st3: y
st2: z
st1: x * x
st0: y * y
Diagram 3:
st3: x
st2: y
st1: z
st0: x * x + y * y
Diagram 4:
st3: x
st2: y
st1: z
st0: x * x + y * y + z * z