Output errors when using libmvec intrinsics for trigo functions manually (like cosf)

Referencing this link, I tried to integrate the libmvec intrinsics into some existing C++ code.

Naturally, this involved me changing the forward declaration to

extern "C" __m128 _ZGVbN4v_cosf(const __m128&);

where I added extern "C" instead of just extern. However I also changed the input signature to const __m128& ie. a reference, since this is how I would normally write inputs to C++ functions.

This compiles, but produces incorrect (random/undefined) output.

If I instead revert back to removing the reference,

extern "C" __m128 _ZGVbN4v_cosf(const __m128);

then this compiles and also produces the intended correct output.

My first question is: is this intended behaviour? Should GCC be throwing a compile error instead if the input argument doesn't match the intrinsic's actual behaviour? Is this due to the fact that the linker cannot see the actual symbol (I am linking via -lm) until runtime? What is actually happening when it compiles and then outputs incorrect values (or is it just undefined behaviour)?

My second question is: since it isn't a reference, is this operation costly? Is the compiler copying my __m128 SIMD vectors every time it calls this? My godbolt test seems to suggest otherwise:

#include <x86intrin.h>

extern "C" __m128 _ZGVbN4v_cosf(const __m128);
extern "C" __m128 _ZGVbN4v_sinf(const __m128&);

__m128 simdsinf(const __m128& x){
    return _ZGVbN4v_sinf(x);
}
__m128 simdcosf(const __m128 x){
    return _ZGVbN4v_cosf(x);
}

simdsinf(float __vector(4) const&):
        jmp     _ZGVbN4v_sinf
simdcosf(float __vector(4)):
        jmp     _ZGVbN4v_cosf

But I'm not sure if this is representative?

Solution

Pass by value is cheap for things (including SIMD vectors) that fit in a single register (XMM0 in this case).
And you always have to declare library functions in a way that matches what their machine code is expecting, otherwise the compiler will put args somewhere else or in the wrong format or whatever.

If it was cheaper or better in general for the functions to take their args by reference, their C declarations would be __m128 foo(const __m128 *p). In asm terms, that's the same as a C++ reference except that a null pointer is allowed. (By the language; it's still fine for functions to require non-null pointers.)

Think about __m128 working essentially like an int or float, that the compiler will keep it in a register whenever that's possible and good, and that it's cheap to pass/return by value in a single register.

Should GCC be throwing a compile error instead if the input argument doesn't match the intrinsic's actual behaviour?

It's not an intrinsic in the sense of compiling to a single machine instruction or in terms of the compiler understanding it as a builtin and being able to constant-propagate through it the way it could for _mm_add_ps or the + operator.

It's just a library function with a name that uses the same style as intrinsics. If you lie to the compiler about the function signature, it's not going to know; it's just going to put a pointer in RDI like you asked it to, while the callee is going to look for a value in XMM0. (Assuming the x86-64 System V calling convention.)

The compiler might recognize the name somehow, or even be able to generate calls to it when vectorizing cosf; if so you might get a warning about the declaration not matching, like what happens if you declare memcpy or printf or whatever yourself in a non-standard way. (Those functions are by default defined by GCC as builtins, which is how it's able to inline memcpy, e.g. for 4-byte copy to an int being just a load or store, or optimizing away. Use -fno-builtin-memcpy to disable that.)

But since you say you didn't get even a warning (hopefully with -Wall), I guess it doesn't try to check those function names against known prototypes. That's why you have to declare them yourself with this hack to get access to them under that implementation-detail C name, instead of them being predefined with some clean name.

My second question is: since it isn't a reference, is this operation costly?

No, movaps xmm0, [mem] to load a vector into a register is very cheap, like 2 or 3 per clock cycle throughput, same as integer loads. A cache miss can make it expensive, same as with an integer load.

If you had a vector in an XMM register already as the result of a computation, the call site would have to store it to memory and lea rdi, [rsp+16] or something before a call. Then the callee would have to reload it before it could do anything with it.

Or to look at it another way: either the caller or the callee has to load the data into a register. If it's the callee, then the caller also has to do extra work to pass a pointer. (In cases where the vector data was already in memory.)

I don't know why people think pass by reference is a good idea for SIMD vectors. Maybe from Windows / MSVC without vectorcall, where the calling convention actually passes by reference anyway even if you pass by value? Or because passing a __m128 (typedef float __attribute__((vector_size(16),may_alias)) in GCC) by reference doesn't change the ABI when SSE is disabled in 32-bit code? Or wait, 32-bit code would always pass on the stack, except maybe with GCC regparm. Maybe from 64-bit code with AVX where __m256 passes in a YMM with AVX enabled, but otherwise can't. But code that passes around Intel intrinsic vector types at all, even by reference, is probably not going to be useful when the corresponding vector extension isn't enabled. (Unlike GNU C native vectors, which can compile correctly without HW vectors of the same width, but often not efficiently.)

Neither reference nor value can avoid the major problem with the x86-64 System V calling convention: there are no call-preserved vector registers. So a loop that works with some temporary values and a bunch of vector constants can't keep any of it in registers across a call to a vector math function. e.g. if you need both the logf and sinf of the same vector, it needs to be spilled (stored to stack space) before logf. Then spill that return value and reload the original vector as an arg for sinf, then finally reload the log result to get both results in registers at once.

That's one case where Windows x64 actually shines: with all of xmm6-15 call-preserved (but not the upper parts of those YMM/ZMM registers), it can keep vector constants and temporaries in XMM registers, with the vector math library function hopefully only using XMM0-5 for scratch space. (And not actually needing to spill/reload many of XMM6-15, otherwise that's no better than having the caller do it.) Otherwise, for SIMD loops in leaf functions which benefit from a lot of regs, Windows x64 requires functions to save/restore lots of them.

[non-inlined wrapper functions that are just tailcall jmps]
But I'm not sure if this is representative?

Not really, those are just tailcalls that pass on their args to another function wherever they are. More interesting to look at how your wrappers inline into real call sites inside loops.