Is it reasonable to expect identical results for LAPACK routines on two different processor architectures?

I'm trying to invert a 400x400 matrix with LAPACK using the Apple Accelerate library. I created a matrix inversion method using ?getrf_ and ?getri.

void invert() {
    aligned_buffer<int> pivot(get_num_rows()*get_num_columns());
    int info = 0;
    int num_rows = get_num_rows();
    int num_columns = get_num_columns();

    if constexpr (std::is_same<T, float>::value) {
        sgetrf_(&num_rows, &num_columns, const_cast<T *> (get()),
                &num_rows, pivot.data(), &info);
    } else {
        dgetrf_(&num_rows, &num_columns, const_cast<T *> (get()),
                &num_rows, pivot.data(), &info);
    }
    assert((info == 0) && "Error factorizing matrix.");

    aligned_buffer<T> work(work_size);

    if constexpr (std::is_same<T, float>::value) {
        sgetri_(&num_rows, const_cast<T *> (get()), &num_rows,
                pivot.data(), work.data(), &work_size, &info);
    } else {
        dgetri_(&num_rows, const_cast<T *> (get()), &num_rows,
                pivot.data(), work.data(), &work_size, &info);
    }

    assert((info == 0) && "Error inverting matrix.");
}

However, I seem to me getting different results when running the code on two different machines. One machine has a Quad-Core Intel Core i7 and the other machine has a Quad-Core Intel Core i5. When I compare the results of the inverted matrix, the results are identical except for rows 49,50 97,98, 148,149 197,198 etc. The differences are tiny but still there.

It is reasonable to expect these to be identical to this level of precision for two different processor achitectures using the same LAPACK library or a different lapack library for that matter?

Solution

Generally, no it's not reasonable to expect the same bit-exact result on different hardware. The fact that you are performing a numerically unstable procedure may also lead to amplification of any differences there are (generally prefer a solve with the factors rather than an invert if possible).

Due to the nature of floating point arithmetic, any difference in order or precision of operations can result in different results. BLAS (which underlies most of LAPACK) is normally written primarily for performance. This means that for different microarchitectures, never mind different architectures, order of operation may be different (e.g. to accommodate different numbers of FMAs units in the core). Data alignment can also make a difference due to loop peeling for some operations. Availability of FMA, vector length, and/or different internal precisions will also make a difference.

Even on the same hardware, with the same library, with the same data in the same alignment, you may still not get reproducibility if threading is involved, unless the library authors have gone out of their way to guarantee this.