We have recently updated our data.table
from version 1.12.0
to 1.12.8
and R from 3.5.3
to 3.6.3
. The example is on Windows OS.
We have a data.table
where we are looping over a Category column and creating a splinefun
object to use later on. We store this splinefun
function outputs in to a list
, within a data.table
column. It worked as expected on our old specs, producing a splinefun
unique for each category level based on the segmented data. However, now it looks like its just keeping the value for the final category and parsing it in to all the entries.
Setup Data
create some fake data for showing the issue
# R version: 3.6.3 (2020-02-29)
library(data.table) # data.table_1.12.8
library(ggplot2)
library(stats)
# mimic our data in simpler format
set.seed(1)
dt <- data.table(cat = rep(letters[1:3], each = 10),
x = 1:10)
dt[, y := x^0.5 * rnorm(.N, mean=runif(1, 1, 100), sd=runif(1, 1, 10)), by=cat]
# can see that each line is different
pl0 <- ggplot(data=dt, aes(x=x, y=y, col=cat)) + geom_line()
pl0
Fit Splines
Fit the splines via our current method and using lapply
for comparison. lapply
works as expected, data.table
doesn't.
# fit spline, segment the data by category
mod_splines <- dt[, .(Spline = list(splinefun(x=x, y=y, method = "natural"))),
by = c("cat")]
# splinefun works such that you provide new values of x and it gives an output
# y from a spline fitted to y~x
# Can see they are all the same, which seems unlikely
mod_splines$Spline[[1]](5)
mod_splines$Spline[[2]](5)
mod_splines$Spline[[3]](5)
# alternative approach
alt_splines <- lapply(unique(dt$cat), function(x_cat){
splinefun(x=dt[cat==x_cat, ]$x,
y=dt[cat==x_cat, ]$y,
method = "natural")
})
# looks more realistic
alt_splines[[1]](5)
alt_splines[[2]](5)
alt_splines[[3]](5) # Matches the mod_splines one!
Checking whether splinefun
is fitting
The data and outputs of the splinefun
look correct when we print out from within the data.table loop, but it doesn't get stored correctly.
# check the data is segmenting
mod_splines2 <- dt[, .(Spline = list(splinefun(x=x, y=y, method = "natural")),
x=x, y=y),
by = c("cat")]
mod_splines2[] # the data is definitely segmenting ok
# try catching and printing the data
splinefun_withmorefun <- function(x, y){
writeLines(paste(x, collapse =", "))
writeLines(paste(round(y, 0), collapse =", "))
foo <- splinefun(x=x,
y=y,
method = "natural")
writeLines(paste(foo(5), collapse =", "))
writeLines("")
return(foo)
}
# looks like its in the function ok, as it prints out different results
mod_splines3 <- dt[, .(Spline = list(splinefun_withmorefun(x=x, y=y))),
by = c("cat")]
# but not coming through in to the listed function
mod_splines3$Spline[[1]](5)
mod_splines3$Spline[[2]](5)
mod_splines3$Spline[[3]](5)
Any ideas why this would be an issue after updates would be great! We're worried there may be other cases using a similar data.table
methodology that could now be silently broken as this one was.
Thank you, Jonny
As I've answered in https://github.com/Rdatatable/data.table/issues/4298#issuecomment-597737776 , adding copy()
on x
and y
variables will solve this issue.
The reason is that splinefun()
would try to store the values of x
and y
. However, the internal object of data.table
is always passed by reference (for the speed)... On this case, you may have to explicitly copy()
the variable in order to have expected answers.
In short, changing
mod_splines <- dt[, .(Spline = list(splinefun(x=x, y=y, method = "natural"))),
by = c("cat")]
to
mod_splines <- dt[, .(Spline = list(splinefun(x=copy(x), y=copy(y), method = "natural"))),
by = c("cat")]
or this (you can ignore this, but it may give you a better understanding)
mod_splines <- dt[, .(Spline = list(splinefun(x=x+0, y=y+0, method = "natural"))),
by = cat]
is enough.