Search code examples
rmeanstandard-deviationbit64

R, bit64, problems calculating row mean and standard deviation in data.table


I am trying to work with larger numbers, over 2^32. While I am also using data.table and fread, I do not believe the problem is related to them. I can turn on and off they symptoms without changing data.table or having used fread. My symptoms are that I am getting a reported mean of 4.1e-302 when I am expecting a positive exponent 1e+3 to 1e+17

The problem consistently appears when using the bit64 package and functions related to integer64. Things work for me in "regular sized data and R" but I am not expressing things correctly in this package. See my code and data below.

I am on a MacBook Pro, 16GB, i7 (updated).

I have restarted my R session and cleared the workspace, but the problem consistently remains.

Please kindly advise, I appreciate the input. I assume it has to to with using the library, bit64.

Links I looked at included bit64 doc

An issue that had similar symptoms caused by an fread() memory leak, but I think I eliminated

Here is my input data

var1,var2,var3,var4,var5,var6,expected_row_mean,expected_row_stddev
1000 ,993 ,987 ,1005 ,986 ,1003 ,996 ,8 
100000 ,101040 ,97901 ,100318 ,96914 ,97451 ,98937 ,1722 
10000000 ,9972997 ,9602778 ,9160554 ,8843583 ,8688500 ,9378069 ,565637 
1000000000 ,1013849241 ,973896894 ,990440721 ,1030267777 ,1032689982 ,1006857436 ,23096234 
100000000000 ,103171209097 ,103660949260 ,102360301140 ,103662297222 ,106399064194 ,103208970152 ,2078732545 
10000000000000 ,9557954451905 ,9241065464713 ,9357562691674 ,9376495364909 ,9014072235909 ,9424525034852 ,334034298683 
1000000000000000 ,985333546044881 ,994067361457872 ,1034392968759970 ,1057553099903410 ,1018695335152490 ,1015007051886440 ,27363415718203 
100000000000000000 ,98733768902499600 ,103316759127969000 ,108062824583319000 ,111332326225036000 ,108671041505404000 ,105019453390705000 ,5100048567944390 

My code, working with this sample data

# file: problem_bit64.R
# OBJECTIVE: Using larger numbers, I want to calculate a row mean and row standard deviation
# ERROR:  I don't know what I am doing wrong to get such errors, seems bit64 related
# PRIORITY: BLOCKED (do this in Python instead?)
# reported Sat 9/24/2016 by Greg

# sample data:
# each row is 100 times larger on average, for 8 rows, starting with 1,000
# for the vars within a row, there is 10% uniform random variation.  B2 = ROUND(A2+A2*0.1*(RAND()-0.5),0)    

# Install development version of data.table --> for fwrite()
install.packages("data.table", repos = "https://Rdatatable.github.io/data.table", type = "source")
require(data.table)
require(bit64)
.Machine$integer.max   # 2147483647     Is this an issue ?
.Machine$double.xmax   # 1.797693e+308  I assume not

# -------------------------------------------------------------------
# ---- read in and basic info that works
csv_in <- "problem_bit64.csv"
dt <- fread( csv_in )
dim(dt)                # 6 8
lapply(dt, class)      # "integer64" for all 8
names(dt)  # "var1" "var2"  "var3"  "var4"  "var5" "var6" "expected_row_mean" "expected_row_stddev"
dtin <- dt[, 1:6, with=FALSE]  # just save the 6 input columns

... now the problems start

# -------------------------------------------------------------------
# ---- CALCULATION PROBLEMS START HERE
# ---- for each row, I want to calculate the mean and standard deviation
a <- apply(dtin, 1, mean.integer64); a   # get 8 values like 4.9e-321
b <- apply(dtin, 2, mean.integer64); b   # get 6 values like 8.0e-308

# ---- try secondary variations that do not work
c <- apply(dtin, 1, mean); c             # get 8 values like 4.9e-321
c <- apply(dtin, 1, mean.integer64); c   # same result
c <- apply(dtin, 1, function(x) mean(x));   c          # same
c <- apply(dtin, 1, function(x) sum(x)/length(x));  c  # same results as mean(x)

##### I don't see any sd.integer64       # FEATURE REQUEST, Z-TRANSFORM IS COMMON
c <- apply(dtin, 1, function(x) sd(x));   c          # unrealistic values - see expected

Regular sized R on Regular data, still using the data read in with fread() into a data.table() - WORKS

# -------------------------------------------------------------------
# ---- delete big numbers, and try regular stuff - WHICH WORKS
dtin2 <- dtin[ 1:3, ]    # just up to about 10 million (SAME DATA, SAME FREAD, SAME DATA.TABLE)
dtin2[ , var1 := as.integer(var1) ]  # I know there are fancier ways to do this
dtin2[ , var2 := as.integer(var2) ]  # but I want things to work before getting fancy.
dtin2[ , var3 := as.integer(var3) ]
dtin2[ , var4 := as.integer(var4) ]
dtin2[ , var5 := as.integer(var5) ]
dtin2[ , var6 := as.integer(var6) ]
lapply( dtin2, class )   # validation

c <- apply(dtin2, 1, mean); c   # get 3 row values AS EXPECTED (matching expected columns)
c <- apply(dtin2, 1, function(x) mean(x));   c          # CORRECT
c <- apply(dtin2, 1, function(x) sum(x)/length(x));  c  # same results as mean(x)

c <- apply(dtin2, 1, sd); c             # get 3 row values AS EXPECTED (matching expected columns)
c <- apply(dtin2, 1, function(x) sd(x));   c          # CORRECT

Solution

  • As a short and first recommendation to most readers: do use 'double' instead of 'integer64' unless you have a specific reason to use 64bit integers. 'double' is an R internal datatype, while 'integer64' is a package extension datatype, which is represented as a 'double' vector with a class attribute 'integer64', i.e. each elements 64 bits are interpreted as 64bit integer by code that knows about this class. Unfortunately many core R functions do not know about 'integer64', which then easily leads to wrong results. Hence coercing to 'double'

    dtind <- dtin
    for (i in seq_along(dtind))
      dtind[[i]] <- as.double(dtind[[i]])
    b <- apply(dtind, 1, mean)
    

    will give the somewhat expected result

    > b
    [1] 9.956667e+02 9.893733e+04 9.378069e+06 1.006857e+09 1.032090e+11 9.424525e+12 1.015007e+15 1.050195e+17
    

    although not exactly what you expected, neither looking at the rounded differences

    > b - dt$expected_row_mean
    integer64
    [1] -1   0    -1   -1   0    -1   -3   -392
    

    nor looking at the unrounded differences

    > b - as.double(dt$expected_row_mean)
    [1]   -0.3333333    0.3333333   -0.3333333   -0.1666666    0.1666718 -0.3339844   -2.8750000 -384.0000000
    Warnmeldung:
    In as.double.integer64(dt$expected_row_mean) :
      integer precision lost while converting to double
    

    OK, let's assume you truly want integer64 because your largest numbers are beyond the integer precision 2^52 of doubles. Then your problem starts with the fact that 'apply' does not know about integer64 and actually destroys the 'integer64' class attribute:

    > apply(dtin, 1, is.integer64)
    [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
    

    It actually destroys the 'integer64' class attribute twice, once in preparing the inputs and once when postprocess the outputs. We can fix this by

    c <- apply(dtin, 1, function(x){
      oldClass(x) <- "integer64"  # fix 
      mean(x) # note that this dispatches to mean.integer64
    })
    oldClass(c) <- "integer64"  # fix again
    

    Now the result looks reasonable

    > c
    integer64
    [1] 995                98937              9378068            1006857435         103208970152       9424525034851      1015007051886437   105019453390704600
    

    but still is not what you expected

    > c - dt$expected_row_mean
    integer64
    [1] -1   0    -1   -1   0    -1   -3   -400
    

    The small differences (-1) are due to rounding, since the floating mean

    > b[1]
    [1] 995.6667
    

    and you assume

    > dt$expected_row_mean[1]
    integer64
    [1] 996
    

    while mean.integer64 coerces (truncates) to integer64. This behavior of mean.integer64 is debatable, however, at least consistent:

    x <- seq(0, 1, 0.25)
    > data.frame(x=x, y=as.integer64(0) + x)
         x y
    1 0.00 0
    2 0.25 0
    3 0.50 0
    4 0.75 0
    5 1.00 1
    > mean(as.integer64(0:1))
    integer64
    [1] 0
    

    The topic of rounding makes clear that implementing sd.integer64 would be even more debatable. Should it return integer64 or double?

    Regarding the bigger differences it is unclear what the rationale of your expectation is: taking the seventh row of your table and substracting its minimum

    x <- (unlist(dtin[7,]))
    oldClass(x) <- "integer64"
    y <- min(x)
    z <- as.double(x - y)
    

    gives numbers in a range where 'double' precisely handles integers

    > log2(z)
    [1] 43.73759     -Inf 42.98975 45.47960 46.03745 44.92326
    

    averaging those and comparing against your expectation still gives a difference not explained by rounding

    > mean(z) - as.double(dt$expected_row_mean[7] - y)
    [1] -2.832031