Search code examples
pythonmatlabstatisticsfloating-point-precision

Precision, why do Matlab and Python numpy give so different outputs?


I know about basic data types and that float types (float,double) can not hold some numbers exactly.

In porting some code from Matlab to Python (Numpy) I however found some significant differences in calculations, and I think it's going back to precision.

Take the following code, z-normalizing a 500 dimensional vector with only first two elements having a non-zero value.

Matlab:

Z = repmat(0,500,1); Z(1)=3;Z(2)=1;
Za = (Z-repmat(mean(Z),500,1)) ./ repmat(std(Z),500,1);
Za(1)
>>> 21.1694

Python:

from numpy import zeros,mean,std
Z = zeros((500,))
Z[0] = 3
Z[1] = 1
Za = (Z - mean(Z)) / std(Z)
print Za[0]
>>> 21.1905669677

Besides that the formatting shows a bit more digits in Python, there is a huge difference (imho), more than 0.02

Both Python and Matlab are using a 64 bit data type (afaik). Python uses 'numpy.float64' and Matlab 'double'.

Why is the difference so huge? Which one is more correct?


Solution

  • Maybe the difference comes from the mean and std calls. Compare those first.

    There are several definitions for std, some use the sqaure root of

    1 / n * sum((xi - mean(x)) ** 2)
    

    others use

    1 / (n - 1) * sum((xi - mean(x)) ** 2)
    

    instead.

    From a mathematical point: these formulas are estimators of the variance of a normal distributed random variable. The distribution has two parameters sigma and mu. If you know mu exactly the optimal estimator for sigma ** 2 is

    1 / n * sum((xi - mu) ** 2)
    

    If you have to estimate mu from the data using mu = mean(xi), the optimal estimator for sigma**2 is

    1 / (n - 1) * sum((xi- mean(x))**2)