Search code examples
rdata.tablesize

data.table size and datatable.alloccol option


The dataset I am working on is not very big, but quite wide. I tcurrently has 10 854 columns and I would like to add approximately another 10/11k columns. It has only 760 rows.

When I try (applying functions to a subset of the existing columns), I get the following

Warning message:
In `[.data.table`(setDT(Final), , `:=`(c(paste0(vars, ".xy_diff"),  :
  truelength (30854) is greater than 10,000 items over-allocated (length = 10854). See ?truelength. If you didn't set the datatable.alloccol option very large, please report to data.table issue tracker including the result of sessionInfo().

I have tried to play with setalloccol, but I get something similar. For example:

setalloccol(Final, 40960)
Error in `[.data.table`(x, i, , ) : 
  getOption('datatable.alloccol') should be a number, by default 1024. But its type is 'language'.
In addition: Warning message:
In setalloccol(Final, 40960) :
  tl (51894) is greater than 10,000 items over-allocated (l = 21174). If you didn't set the datatable.alloccol option to be very large, please report to data.table issue tracker including the result of sessionInfo().

Is there a way to bypass this problem?

Thanks a lot

Edit:

to answer Roland's comment, here is what I am doing:

vars <- c(colnames(FinalTable_0)[271:290], colnames(FinalTable_0)[292:dim(FinalTable_0)[2]]) # <- variables I want to operate on
# FinalTable_0 is a previous table I use to collect the roots of the variables I want to work with
difference <- function(root) lapply(root, function(z) paste0("get('", z, ".x') - get('", z, ".y')"))
ratio <- function(root) lapply(root, function(z) paste0("get('", z, ".x') / get('", z, ".y')"))
# proceed to the computation
setDT(Final)[ , c(paste0(vars,".xy_diff"), paste0(vars,".xy_ratio")) := lapply(c(difference(vars), ratio(vars)), function(x) eval(parse(text = x)))]

Solution

  • I tried the solution proposed by Roland, but was not fully satisfied. It works, but I do not like the idea of transposing my data.

    In the end, I just split the original data.table into multiple ones, proceeded to the computations on each individually and joined back at the end. Fast and simple, no need to play with variables, tell which ones are ids and which are measures, no need to shape and reshape. I just prefer.