Search code examples
rdata.tablespline

data.table not returning the correct splinefun by group


We have recently updated our data.table from version 1.12.0 to 1.12.8 and R from 3.5.3 to 3.6.3. The example is on Windows OS.

We have a data.table where we are looping over a Category column and creating a splinefun object to use later on. We store this splinefun function outputs in to a list, within a data.table column. It worked as expected on our old specs, producing a splinefun unique for each category level based on the segmented data. However, now it looks like its just keeping the value for the final category and parsing it in to all the entries.

Setup Data

create some fake data for showing the issue

# R version: 3.6.3 (2020-02-29)
library(data.table) # data.table_1.12.8
library(ggplot2)
library(stats) 

# mimic our data in simpler format
set.seed(1)
dt <- data.table(cat = rep(letters[1:3], each = 10),
                 x = 1:10)
dt[, y := x^0.5 * rnorm(.N, mean=runif(1, 1, 100), sd=runif(1, 1, 10)), by=cat]

# can see that each line is different
pl0 <- ggplot(data=dt, aes(x=x, y=y, col=cat)) + geom_line()
pl0

Fit Splines

Fit the splines via our current method and using lapply for comparison. lapply works as expected, data.table doesn't.

# fit spline, segment the data by category
mod_splines <- dt[, .(Spline = list(splinefun(x=x, y=y, method = "natural"))),
                  by = c("cat")]

# splinefun works such that you provide new values of x and it gives an output
# y from a spline fitted to y~x
# Can see they are all the same, which seems unlikely
mod_splines$Spline[[1]](5)
mod_splines$Spline[[2]](5)
mod_splines$Spline[[3]](5)

# alternative approach
alt_splines <-  lapply(unique(dt$cat), function(x_cat){
  splinefun(x=dt[cat==x_cat, ]$x, 
            y=dt[cat==x_cat, ]$y, 
            method = "natural")
})

# looks more realistic
alt_splines[[1]](5)
alt_splines[[2]](5)
alt_splines[[3]](5) # Matches the mod_splines one!

Checking whether splinefun is fitting

The data and outputs of the splinefun look correct when we print out from within the data.table loop, but it doesn't get stored correctly.

# check the data is segmenting
mod_splines2 <- dt[, .(Spline = list(splinefun(x=x, y=y, method = "natural")),
                      x=x, y=y),
                  by = c("cat")]
mod_splines2[] # the data is definitely segmenting ok

# try catching and printing the data
splinefun_withmorefun <- function(x, y){

  writeLines(paste(x, collapse =", "))
  writeLines(paste(round(y, 0), collapse =", "))

  foo <- splinefun(x=x, 
            y=y, 
            method = "natural")
  writeLines(paste(foo(5), collapse =", "))
  writeLines("")
  return(foo)
}

# looks like its in the function ok, as it prints out different results 
mod_splines3 <- dt[, .(Spline = list(splinefun_withmorefun(x=x, y=y))),
                   by = c("cat")]

# but not coming through in to the listed function
mod_splines3$Spline[[1]](5)
mod_splines3$Spline[[2]](5)
mod_splines3$Spline[[3]](5)

Any ideas why this would be an issue after updates would be great! We're worried there may be other cases using a similar data.table methodology that could now be silently broken as this one was.

Thank you, Jonny


Solution

  • As I've answered in https://github.com/Rdatatable/data.table/issues/4298#issuecomment-597737776 , adding copy() on x and y variables will solve this issue.

    The reason is that splinefun() would try to store the values of x and y. However, the internal object of data.table is always passed by reference (for the speed)... On this case, you may have to explicitly copy() the variable in order to have expected answers.

    In short, changing

    mod_splines <- dt[, .(Spline = list(splinefun(x=x, y=y, method = "natural"))),
                      by = c("cat")]
    

    to

    mod_splines <- dt[, .(Spline = list(splinefun(x=copy(x), y=copy(y), method = "natural"))),
                      by = c("cat")]
    

    or this (you can ignore this, but it may give you a better understanding)

    mod_splines <- dt[, .(Spline = list(splinefun(x=x+0, y=y+0, method = "natural"))),
                      by = cat]
    

    is enough.