I am doing data analysis in Python (Numpy) and R. My data is a vector 795067 X 3 and computing the mean, median, standard deviation, and IQR on this data yields different results depending on whether I use Numpy or R. I crosschecked the values and it looks like R gives the "correct" value.
Median:
Numpy:14.948499999999999
R: 14.9632
Mean:
Numpy: 13.097945407088607
R: 13.10936
Standard Deviation:
Numpy: 7.3927612774052083
R: 7.390328
IQR:
Numpy:12.358700000000002
R: 12.3468
Max and min of the data are the same on both platforms. I ran a quick test to better understand what is going on here.
In Numpy, the numbers are float64 datatype and they are double in R. What is going on here? Why are Numpy and R giving different results? I know R uses IEEE754 double-precision but I don't know what precision Numpy uses. How can I change Numpy to give me the "correct" answer?
The print
statement/function in Python will print single-precision floats. Calculations will actually be done in the precision specified. Python/numpy uses double-precision float by default (at least on my 64-bit machine):
import numpy
single = numpy.float32(1.222) * numpy.float32(1.222)
double = numpy.float64(1.222) * numpy.float64(1.222)
pyfloat = 1.222 * 1.222
print single, double, pyfloat
# 1.49328 1.493284 1.493284
print "%.16f, %.16f, %.16f"%(single, double, pyfloat)
# 1.4932839870452881, 1.4932839999999998, 1.4932839999999998
In an interactive Python/iPython shell, the shell prints double-precision results when printing the results of statements:
>>> 1.222 * 1.222
1.4932839999999998
In [1]: 1.222 * 1.222
Out[1]: 1.4932839999999998
It looks like R is doing the same as Python when using print
and sprintf
:
print(1.222 * 1.222)
# 1.493284
sprintf("%.16f", 1.222 * 1.222)
# "1.4932839999999998"
In contrast to interactive Python shells, the interactive R shell also prints single-precision when printing the results of statements:
> 1.222 * 1.222
[1] 1.493284
The differences in your results could result from using single-precision values in numpy. Calculations with a lot of additions/subtractions will ultimately make the problem surface:
In [1]: import numpy
In [2]: a = numpy.float32(1.222)
In [3]: a*6
Out[3]: 7.3320000171661377
In [4]: a+a+a+a+a+a
Out[4]: 7.3320003
As suggested in the comments to your actual question, make sure to use double-precision floats in your numpy calculations.