Search code examples
c++assemblyx86fpux87

Fault in calculation with small numbers in fpu?


Hi, i am trying to do this formula in fpu.

(y = v*t*sin(a) - 0.5*g*t^2) 

My code in c++ is:

typedef void(*Ipa_algorithm2)(double t, double alfa, double *return_value);
Ipa_algorithm2 count_y;
count_y = (Ipa_algorithm2)GetProcAddress(hInstLibrary, "ipa_algorithm");
t = t + 0.01; //going from cas = 0
(*count_y)(t,camera.angleY, &y); //t = cas; 

and my code in asm is:

section .data
help_var dq 0 
speed dq 40.0 ;v = rychlost
number dq 180.0
grav dq 4.906865 ;grav= 0,5*g

ipa_algorithm2:
push ebp
mov ebp, esp
finit
fld qword [speed]
fld qword [ebp+8]
fmul st1
fstp qword [help_var] ;v pomocny je v*t
fldpi               
fld qword [ebp+16]  ;na st0 je uhel a na st1 3,14
fmul st1 ;na st0 je uhel * 3,14
fld qword [number]
fxch st1 ;na st0 je uhel*3,14 na st1 je 180
fdiv st1 ;na st0 je uhel v radianech
fsin
fld qword [help_var]
fmul st1 ;na st0 je v*t*sin uhlu
fst qword [help_var]
finit
fld qword [ebp+8]
fld qword [ebp+8]
fmul st1
fld qword [grav]
fmul st1
fld qword [help_var]
fxch st1
fsub st1


mov eax,[ebp+24]
fstp qword [eax]    

mov esp, ebp
pop ebp
ret 0

The problem is, that function ipa_algorithm2 is giving me correct numbers from the start (compared with output from program doing the same in C), but after severals steps the results start to be worse and worse. I was checking the code for 3 hours and I didn't find any mistake. Is it possible that the numbers I am counting with are so small that fpu can't count with them?


Solution

  • Update: according to a comment, you're getting wrong numbers for a whole range of inputs, so presumably you just have a regular bug in implementing the formula, not an FP-specific rounding-error or numerical accuracy/stability type of problem. Single-step your function in a debugger for an input that gives a wrong answer, and look at the register values.

    Or better, rewrite this with scalar AVX instructions, because scalar AVX is easier than x87, and you eventually want a vectorized AVX implementation anyway, so a working scalar implementation is a better starting point. For sin(), call a vectorized sin() implementation, or let gcc auto-vectorize your function with -O3 -ffast-math. (See https://sourceware.org/glibc/wiki/libmvec: glibc has vectorized math library functions.)

    Starting with a scalar x87 implementation using the slow fsin instruction is probably the least useful starting point if you eventually want something that runs fast. Good clean C would be better than a sloppy asm implementation for an instruction set you're not even going to use. (And for the final optimized version, C with intrinsics would make more sense than hand-written asm in most cases). See http://agner.org/optimize/, and other links in the x86 tag wiki.


    Storing angles in games

    Store directions as [x,y] vectors, not angles in radian. (Or degrees). With a normalized xy vector, adding two angles becomes a 2x2 matrix multiply (by a rotation matrix). But sin becomes trivial: if you keep your vector normalized (x^2 + y^2 = 1.0) then sin(angle) = angle.y.

    Avoid using actual angles whenever possible and use normalized vectors instead. You sometimes need atan2, but usually rarely enough that you can just use the plain library version.

    If you store your xy pairs in a struct-of-arrays format, it will be SIMD friendly, and you can easily do stuff with 8 float x values and 8 float matching y values. Doing stuff with a direction vector packed into a single SIMD vector is usually NOT optimal; don't be fooled by the word "vector".

    See also https://stackoverflow.com/tags/sse/info and especially SIMD at Insomniac Games (GDC 2015). This will help you understand how to design your program so you can later optimize it with SIMD in the places where that's worthwhile. (You don't have to vectorize everything at the start, but changing your data layout is often a lot of work so consider making your data SIMD friendly in the first place.)


    Possible sources of numerical error (turns out not to be the real problem here)

    One possible reason: The worst-case error for the fsin instruction for small inputs is actually about 1.37 quintillion units in the last place, leaving fewer than four bits correct.. Most modern math libraries don't use the fsin instruction to compute the sin function, because it's not fast and has poor accuracy for some inputs.

    Also, depending how you built your code, something (like the MSVCRT startup, if you're on Windows and using an old version of it) may have set the x87 FPU to less than 80-bit precision (64-bit mantissa).


    Why are you writing this in asm? Do you want advice on how to make it more efficient? You should return a float in st0 as the return value, instead of storing through a pointer arg. Also, don't use finit. I think you're only doing that because you don't balance the x87 stack with pops after you load stuff, so after repeated calls you'd otherwise get NaNs from x87 stack overflow. You're still returning with the x87 stack non-empty in a function that returns void, so you're still doing it wrong and could break the caller.

    USe fstp or fmulp to leave the stack balanced. Use fld st0 instead of another load. Use fmul qword [grav_zrychleni] instead of a separate fld.

    Or better, use SSE2 or AVX for scalar double precision math. Unless you really want 80-bit long double.