Search code examples
pythonarraysnumpymean

Numpy mean giving slightly different results based on row order


In a test case we are using np.testing.assert_allclose to determine whether two data sources agree with each other on the mean. But despite having the same the data in a different the order, the computed means are slightly different. Here is a the shortest working example:

import numpy as np

x = np.array(
    [[0.5224021, 0.8526993], [0.6045113, 0.7965965], [0.5053657, 0.86290526], [0.70609194, 0.7081201]],
    dtype=np.float32,
)
y = np.array(
    [[0.5224021, 0.8526993], [0.70609194, 0.7081201], [0.6045113, 0.7965965], [0.5053657, 0.86290526]],
    dtype=np.float32,
)
print("X mean", x.mean(0))
print("Y mean", y.mean(0))
z = x[[0, 3, 1, 2]]
print("Z", z)
print("Z mean", z.mean(0))

np.testing.assert_allclose(z.mean(0), y.mean(0))
np.testing.assert_allclose(x.mean(0), y.mean(0))

with Python 3.10.6 and NumPy 1.24.2, gives the following output:

X mean [0.58459276 0.8050803 ]
Y mean [0.5845928 0.8050803]
Z [[0.5224021  0.8526993 ]
 [0.70609194 0.7081201 ]
 [0.6045113  0.7965965 ]
 [0.5053657  0.86290526]]
Z mean [0.5845928 0.8050803]
Traceback (most recent call last):
  File "/home/nuric/semafind-db/scribble.py", line 19, in <module>
    np.testing.assert_allclose(x.mean(0), y.mean(0))
  File "/home/nuric/semafind-db/.venv/lib/python3.10/site-packages/numpy/testing/_private/utils.py", line 1592, in assert_allclose
    assert_array_compare(compare, actual, desired, err_msg=str(err_msg),
  File "/usr/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/home/nuric/semafind-db/.venv/lib/python3.10/site-packages/numpy/testing/_private/utils.py", line 862, in assert_array_compare
    raise AssertionError(msg)
AssertionError: 
Not equal to tolerance rtol=1e-07, atol=0

Mismatched elements: 1 / 2 (50%)
Max absolute difference: 5.9604645e-08
Max relative difference: 1.0195925e-07
 x: array([0.584593, 0.80508 ], dtype=float32)
 y: array([0.584593, 0.80508 ], dtype=float32)

A solution is to reduce the tolerance for the assertion but any ideas why this might be happening?


Solution

  • You should use np.float64 to get more precision, np.float32 is suited for numbers with numbers up to 3 decimal places in my experience. This code will work:

    import numpy as np
    
    x = np.array(
        [[0.5224021, 0.8526993], [0.6045113, 0.7965965], [0.5053657, 0.86290526], [0.70609194, 0.7081201]],
        dtype=np.float64,
    )
    y = np.array(
        [[0.5224021, 0.8526993], [0.70609194, 0.7081201], [0.6045113, 0.7965965], [0.5053657, 0.86290526]],
        dtype=np.float64,
    )
    print("X mean", x.mean(0))
    print("Y mean", y.mean(0))
    z = x[[0, 3, 1, 2]]
    print("Z", z)
    print("Z mean", z.mean(0))
    
    np.testing.assert_allclose(z.mean(0), y.mean(0))
    np.testing.assert_allclose(x.mean(0), y.mean(0))
    

    Another thing you can do is increase the tolerance:

    import numpy as np
    
    x = np.array(
        [[0.5224021, 0.8526993], [0.6045113, 0.7965965], [0.5053657, 0.86290526], [0.70609194, 0.7081201]],
        dtype=np.float32,
    )
    y = np.array(
        [[0.5224021, 0.8526993], [0.70609194, 0.7081201], [0.6045113, 0.7965965], [0.5053657, 0.86290526]],
        dtype=np.float32,
    )
    print("X mean", x.mean(0))
    print("Y mean", y.mean(0))
    z = x[[0, 3, 1, 2]]
    print("Z", z)
    print("Z mean", z.mean(0))
    
    np.testing.assert_allclose(z.mean(0), y.mean(0), rtol=1e-6)
    np.testing.assert_allclose(x.mean(0), y.mean(0), rtol=1e-6)
    

    Finally, this error happens because they sum is done in a different order in each of the 3 cases and thus there will be a slight difference in each of the numbers because they will be rounded to np.float32. You can see that by printing more decimal places:

    import numpy as np
    
    np.set_printoptions(formatter={'float': lambda x: "{0:0.10f}".format(x)})
    
    x = np.array(
        [[0.5224021, 0.8526993], [0.6045113, 0.7965965], [0.5053657, 0.86290526], [0.70609194, 0.7081201]],
        dtype=np.float32,
    )
    y = np.array(
        [[0.5224021, 0.8526993], [0.70609194, 0.7081201], [0.6045113, 0.7965965], [0.5053657, 0.86290526]],
        dtype=np.float32,
    )
    print("X mean", x.mean(0))
    print("Y mean", y.mean(0))
    z = x[[0, 3, 1, 2]]
    print("Z", z)
    print("Z mean", z.mean(0))
    
    np.testing.assert_allclose(z.mean(0), y.mean(0), rtol=1e-6)
    np.testing.assert_allclose(x.mean(0), y.mean(0), rtol=1e-6)
    

    Which will print:

    X mean [0.5845927596 0.8050802946]
    Y mean [0.5845928192 0.8050802946]
    Z [[0.5224021077 0.8526992798]
     [0.7060919404 0.7081201077]
     [0.6045113206 0.7965965271]
     [0.5053657293 0.8629052639]]
    Z mean [0.5845928192 0.8050802946]