How to compare two Math library implementations?

As you know, C standard library defines several standard functions calls that should be implemented by any compliant implementation e.g., Newlib, MUSL, GLIBC ...

If I am targetting Linux for example, I have to choose between glibc and MUSL, and the criteria for me is accuracy of the math library libm. How can I compare two possible implementations of, say sin() or cos() for example?

A naive approach would be to test the output quality of result of both implementations on a set of randomly generated inputs with a reference one (from Matlab for example), but is there any other more reliable/formal/structured/guided way to compare/model the two? I tried to see if there is any research in this direction but I found any, so any pointers are appreciated.

Solution

Some thoughts:

You can use the GNU Multiple Precision Arithmetic Library (GnuMP to generate good reference results.
It is possible to test most, if not all of the single-argument single-precision (IEEE-754 binary32) routines exhaustively. (For some of the macOS trigonometric routines, such as sinf, we tested one implementation exhaustively, verifying that it returned faithfully rounded results, meaning the result was the mathematical value [if representable] or one of the two adjacent values [if not]. Then, when changing implementations, we compared one to the other. If a new-implementation result was identical to the old-implementation result, it passed. Otherwise, GnuMP was used to test it. Since new implementations largely coincided with old implementations, this resulted in few invocations of GnuMP, so we were able to exhaustively test a new routine implementation in about three minutes, if I recall correctly.)
It is not feasible to test the multiple-argument or double-precision routines exhaustively.
When comparing implementations, you have to choose a metric, or several metrics. A library that has a good worst-case error is good for proofs; its bound can be asserted to hold true for any argument, and that can be used to derive further bounds in subsequent computations. But a library that has a good average error may tend to produce better results for, say, physics simulations that use large arrays of data. For some applications, only the errors in a “normal” domain may be relevant (angles around −2π to +2π), so errors in reducing large arguments (up to around 10³⁰⁸) may be irrelevant because those arguments are never used.
There are some common points where various routines should be tested. For example, for the trigonometric routines, test at various fractions of π. Aside from being mathematically interesting, these tend to be where implementations switch between approximations, internally. Also test at large numbers that are representable but happen to be very near multiples of simple fractions of π. These are the worst cases for argument reduction and can yield huge relative errors if not done correctly. They require number theory to find. Testing in any sort of scattershot approach, or even orderly approaches that fail to consider this reduction problem, will fail to find these troublesome arguments, so it would be easy to report as accurate a routine that had huge errors.
On the other hand, there are important points to test that cannot be known without internal knowledge of the implementation. For example, when designing a sine routine, I would use the Remez algorithm to find a minimax polynomial, aiming for it to be good from, say, –π/2 to +π/2 (quite large for this sort of thing, but just for example). Then I would look at the arithmetic and rounding errors that could occur during argument reduction. Sometimes they would produce a result a little outside that interval. So I would go back to the minimax polynomial generation and push for a slightly larger interval. And I would also look for improvements in the argument reduction. In the end, I would end up with a reduction guaranteed to produce results within a certain interval and a polynomial known to be good to a certain accuracy within that interval. To test my routine, you then need to know the endpoints of that interval, and you have to be able to find some arguments for which the argument reduction yields points near those endpoints, which means you have to have some idea of how my argument reduction is implemented—how many bits does it use, and such. Like the troublesome arguments mentioned above, these points cannot be found with a scattershot approach. But unlike those above, they cannot be found from pure mathematics; you need information about the implementation. This makes it practically impossible to know you have compared the worst potential arguments for implementations.