Search code examples
rcorrelationbignumpearson

Wrong correlation result for big numbers


The cor() function fails to compute the correlation value if there are extremely big numbers in the vector and returns just zero:

foo <- c(1e154, 1, 0)
bar <- c(0, 1, 2)
cor(foo, bar)
# -0.8660254
foo <- c(1e155, 1, 0)
cor(foo, bar)
# 0

Although 1e155 is very big, it's much smaller than the maximum number R can deal with. It's surprising for me why R returns a wrong value and does not return a more suitable result like NA or Inf.

Is there any reason for that? How to be sure we will not face such a situation in our programs?


Solution

  • Pearson's correlation coefficient between two variables is defined as the covariance of the two variables divided by the product of their standard deviations. (from http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient)

    foo <- c(1e154, 1, 0)
    sd(foo)
    ## [1] 5.773503e+153
    foo <- c(1e155, 1, 0)
    sd(foo)
    ## [1] Inf
    

    And, even more fundamental, to calculate sd() you need to take the square of x:

    1e154^2
    [1] 1e+308
    
    1e155^2
    [1] Inf
    

    So, your number is indeed at the boundary of what is possible to calculate using 64 bits.

    Using R-2.15.2 on Windows I get:

    cor(c(1e555, 1, 0), 1:3)
    [1] NaN