c++assembly inline-assembly cartesian x87

How to convert old x87 assembly code to extended asm (with "=u" and "=t" constraints) to convert spherical coordinates

I have this old code to transform spherical to Cartesian 3D coordinates :

TDVector3D Cartesian3D_asm(const double &Theta, const double &Phi)
{
  TDVector3D V;
  __asm__
  {
    mov    eax,[ebp+0x0C]
    mov    edx,[ebp+0x10]
    fld     qword ptr [eax]  // ST0=T     Theta
    fsincos                  // ST1=sin(T)  ST0=cos(T)
    fxch    ST(1)            // ST1=cos(T)  ST0=sin(T)
    fld     qword ptr [edx]  // ST2=cos(T)  ST1=sin(T)  ST0=P   Phi
    fsincos                  // ST3=cos(T)  ST2=sin(T)  ST1=sin(P) ST0=cos(P)
    fmul    ST,ST(2)         // ST3=cos(T)  ST2=sin(T)  ST1=sin(P) ST0=cos(P)*sin(T)
    fstp    qword ptr V.X    // ST2=cos(T)  ST1=sin(T)  ST0=sin(P)

    fmulp   ST(1),ST         // ST1=cos(T)  ST0=sin(P)*sin(T)
    fstp    qword ptr V.Y    // ST0=cos(T)

    fstp    qword ptr V.Z    // Coprocesseur vide
    fwait
  }
  return V;
}

with this TDVector3D struct :

typedef struct TDVector3D {
        double X, Y, Z;
        TDVector3D(double x, double y, double z): X(x), Y(y), Z(z) { }
} TDVector3D;

The no assembler code is :

TDVector3D Cartesian3D(const double &Theta, const double &Phi)
{
  double X, Y, Z;
  X = Y = sin(Theta);
  X *= cos(Phi);
  Y *= sin(Phi);
  Z = cos(Theta);
  return TDVector3D(X, Y, Z);
}

I found this sample for SinCos :

void SinCos(double Theta, double *sinT, double *cosT)
{
    __asm__ ("fsincos" : "=t" (*cosT), "=u" (*sinT) : "0" (Theta));
}

I try to convert my old code but I am totally lost with "u", "0", "t" (who is who regarding ST0, ST1, ...).

Solution

https://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html - "=t" means that output picks the Top of Stack register, st(0). "0" picks the same location as operand 0, a "matching constraint", so also st(0), which makes sense because that's the input for fsincos. "=u" is unsurprisingly st(1), the other output.

IDK why you want to use inline asm at all, though, instead of just letting the compiler optimize sincos() (a GNU extension) from <math.h> / <cmath>; a math library function call is fine, and maybe faster than the x87 instruction. Might even auto-vectorize if Cartesian3D is called in a loop, producing 2 or 4 results with about the same amount of work as one.

Also, why take a double by const reference? It's small enough to pass by value. BTW, the main reason to use x87 in modern code would be for 80-bit extended precision. If you need that, still just use sincosl. GCC might inline it to a fsincos instruction, or might call a library function which might still be faster; https://agner.org/optimize times fsincos at 60-120 uops, with 60-140 cycle latency (strangely no throughput reported.)

Also, if you do insist on using asm to force it to run an x87 fsincos, you don't need to convert it to one bit asm() statement, you can just call that working SinCos() twice, for two separate inputs, and the compiler will take care of loads / stores and fxchg. Well, I thought you wouldn't need to, but GCC and clang do a rather poor job in 64-bit mode, wanting to bounce data back to XMM registers for the multiplies. https://godbolt.org/z/1j9df4cro whether you pass by value (forcing it to spill XMM registers to memory first if it doesn't inline) or by reference.

Even in 32-bit mode it was auto-vectorizing the multiply I think, wanting to use mulpd. https://godbolt.org/z/sTd43zono shows GCC -m32 -O3 -mfpmath=387 -fno-tree-vectorize making efficient asm using your wrapper function, with about the same number of instructions as your inline asm{} block.

static inline void SinCos_x87(double Theta, double *sinT, double *cosT)
{
    __asm__ ("fsincos" : "=t" (*cosT), "=u" (*sinT) : "0" (Theta));
}

#if 1
 #define SINCOS SinCos_x87
#else
 #include <math.h>
 #define SINCOS sincos
#endif

TDVector3D Cartesian3D_sincos(const double Theta, const double Phi)
{
    double sinTheta, cosTheta, sinPhi, cosPhi;
    SINCOS(Theta, &sinTheta, &cosTheta);
    SINCOS(Phi, &sinPhi, &cosPhi);
    double X = sinTheta * cosPhi;
    double Y = sinTheta * sinPhi;
    double Z = cosTheta;
    return TDVector3D(X, Y, Z);
}

Unlike MSVC's inefficient style of asm{} block, GNU C inline asm can efficiently wrap a single instruction, and tell the compiler what registers to place the inputs and find the outputs. There is no overhead, so as long as the compiler can generate efficient code between the inlined asm("":::) statements, there's no benefit to having one more complicated asm statement.

That is the case for 32-bit mode without auto-vectorization, but not for 64-bit mode (unless you also use -mfpmath=387 for that compilation unit! Godbolt)