There are plenty of good implementations to pick from:
- Intel MKL is likely the best on Intel machines. It's not free though, so that may be a problem.
- According to their benchmark, OpenBLAS compares quite well with Intel MKL and is free
- Eigen is also an option and has a largish (albeit old) benchmark showing good performance on small matrices (though it's not technically a drop-in BLAS library)
- ATLAS, OSKI, POSKI are examples of auto-tuned kernels which will claim to work on many architectures
Generally, it is quite hard to pick one of these without benchmarking because:
- some implementations work better on different types of matrices. For example Eigen works better on matrices with small rank (100s)
- some are optimised for specific architectures (e.g. Intel's)
- in some cases the multithreading of the BLAS library may conflict with a multithreaded application (e.g. OpenBLAS)
- developer's benchmarks may tend to emphasise cases which work better on their implementation.
I would suggest pick one or two of these libraries that apply for your use case and benchmark them for your particular application on your particular (or similar) machine. This is quite easy to do even after compiling your code.