macos clang osx-yosemite floating-accuracy accelerate-framework

Numerical differences between older Mac Mini and newer Macbook

I have a project that I compile on both my Mac Mini (Core2 Duo) and a 2014 Macbook quadcore i7. Both are running the latest version of Yosemite. The application is single threaded and I am compiling the tool and libraries using the exact same version of cmake and the clang (xcode) compiler. I am getting test failures due to slight numeric differences.

I am wondering if the inconsistency is coming from the clang compiler automatically doing processor specific optimizations, (which I did not select in cmake)? Could the difference be between the processors? Do the frameworks use processor specific optimizations? I am using the BLAS/Lapack routines the from the Accelerate framework. They are called from the SuperLU sparse matrix factorization package.

Solution

In general you should not expect results from BLAS or LAPACK to be bitwise reproducible across machines. There are a number of factors that implementors tune to get the best performance, all of which result in small differences in rounding:

your two machines have different numbers of processors, which will result in work being divided differently for threading purposes (even if your application is single threaded, BLAS may use multiple threads internally).
your two machines handle hyper threading quite differently, which may also cause BLAS to use different numbers of threads.
the cache and TLB hierarchy is different between your two machines, which means that different block sizes are optimal for data reuse.
the SIMD vector size on the newer machine is twice as large as that on the older machine, which again will effect how arithmetic is grouped.
finally, the newer machine supports FMA (and using FMA is necessary to get the best performance on it); this also contributes to small differences in rounding.

Any one of these factors would be enough to result in small differences; taken together it should be expected that the results will not be bitwise identical. And that's OK, so long as both results satisfy the error bounds of the computation.

Making the results identical would require severely limiting the performance on the newer machine, which would result in your shiny expensive hardware going to waste.