Search code examples
rdata.tablereshape2dcast

Error with large numerics in dcast.data.table


Given a data frame I am trying to cast from long-to-wide using the dcast.data.table function from library(data.table). However, when using large numeric's on the left side of the formula it some how combines.

Below is an example:

df <- structure(list(A = c(10000000007624, 10000000007619, 10000000007745, 
10000000007624, 10000000007767, 10000000007729, 10000000007705, 
10000000007711, 10000000007784, 10000000007745, 10000000007624, 
10000000007762, 10000000007762, 10000000007631, 10000000007762, 
10000000007619, 10000000007628, 10000000007705, 10000000007762, 
10000000007624, 10000000007745, 10000000007706, 10000000007767, 
10000000007777, 10000000007624, 10000000007745, 10000000007624, 
10000000007777, 10000000007771, 10000000007631, 10000000007624, 
10000000007640, 10000000007642, 10000000007708, 10000000007711, 
10000000007745, 10000000007767, 10000000007655, 10000000007722, 
10000000007745, 10000000007762, 10000000007771, 10000000007617
), B = c(4060697L, 7683673L, 7699192L, 1322422L, 7754939L, 7448486L, 
2188027L, 1061376L, 2095950L, 7793530L, 2095950L, 6447861L, 2188027L, 
7448451L, 7428427L, 7516354L, 7067801L, 2095950L, 6740142L, 405911L, 
4057215L, 1061345L, 7754945L, 7501748L, 2188027L, 7780980L, 6651988L, 
6649330L, 6655118L, 6556367L, 6463510L, 2347462L, 7675114L, 6556361L, 
1061345L, 7224099L, 6463515L, 2188027L, 6463515L, 7311234L, 7764971L, 
7224099L, 2347479L), C = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 
3L, 3L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 25L, 2L, 1L, 2L, 
1L, 1L, 1L)), .Names = c("A", "B", "C"), row.names = c(NA, -43L
), class = "data.frame")

df <- as.data.table(df)

output <- dcast.data.table(df, A ~ B, value.var = "C",
                           fun.aggregate = sum, fill = NA)

This will produce only 2 rows, 10000000007624 & 10000000007784 and everything will be summed up in just those two.

This error does not occur when using reshape2::dcast function, this method produces the correct result.

Is there a reason why dcast.data.table is producing this error?


Solution

  • Issue was raised on github and responded by @jangorecki and this answer comes from the setNumericRounding help document.

    when joining or grouping, data.table rounds such data to apx 11 s.f. which is plenty of digits for many cases. This is achieved by rounding the last 2 bytes off the significand.

    As such my 14 digit large numeric's where getting rounded and therefore combined.

    As @jangorecki mentions this can be avoided by setting setNumericRounding(0). However, I personally have re-classified my large numeric's to factors. This make more sense for my particular use case.

    Further to this @jangorecki also advises use of bit64 package when dealing with large numeric's.

    The original post on github.