Search code examples
rdata.tablepredict

How to fill up the missing values in data.table?


I have two data.tables in which I have some missing values. The non-missing values make a curve and the missing values should fit into this curve.

temp_dt_1 = structure(list(x = c(33, 34, 35, 36, 37, 38, 39, 40, 45, 50, 
51.5, 52.5, 53, 54, 55, 55.5, 56, 56.5, 57, 57.5, 58, 59, 59.5, 
60, 60.5, 61, 61.5, 62, 62.5, 63, 63.5, 64, 65, 66, 66.5, 67, 
67.5), y = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, 2.33, NA, NA, 1.6, NA, NA, 0.91, NA, 0.665, 
NA, 0.465, 0.39, 0.325, 0.265, 0.24, 0.135, 0.1, 0.095, NA, 0.105
)), row.names = c(NA, -37L), class = c("data.table", "data.frame"
))

> temp_dt_1 
       x     y
 1: 33.0    NA
 2: 34.0    NA
 3: 35.0    NA
 4: 36.0    NA
 5: 37.0    NA
 6: 38.0    NA
 7: 39.0    NA
 8: 40.0    NA
 9: 45.0    NA
10: 50.0    NA
11: 51.5    NA
12: 52.5    NA
13: 53.0    NA
14: 54.0    NA
15: 55.0    NA
16: 55.5    NA
17: 56.0    NA
18: 56.5 2.330
19: 57.0    NA
20: 57.5    NA
21: 58.0 1.600
22: 59.0    NA
23: 59.5    NA
24: 60.0 0.910
25: 60.5    NA
26: 61.0 0.665
27: 61.5    NA
28: 62.0 0.465
29: 62.5 0.390
30: 63.0 0.325
31: 63.5 0.265
32: 64.0 0.240
33: 65.0 0.135
34: 66.0 0.100
35: 66.5 0.095
36: 67.0    NA
37: 67.5 0.105

The second data.table has missing values on both sides (at start and end of y) -

temp_dt_2 = structure(list(x = c(25, 27.5, 30, 32.5, 35, 40, 43, 45, 47, 
50, 52.5, 55, 57.5, 60, 65, 85, 90), y = c(NA, NA, 1.1, NA, 
1.765, 2.64, 3.42, 3.975, NA, 5.625, 6.95, 7.825, 8.95, 10.375, 
13.725, NA, NA)), row.names = c(NA, -17L), class = c("data.table", 
"data.frame"))

In the plot below the black line shows the known points (which are present in data.table) and the red line is the approximate expected output for temp_dt_1

library(ggplot2)
ggplot(temp_dt_1, aes(x, y)) + geom_line()

enter image description here

I am looking for only data.table solution. I will really appreciate it if the code has good performance.


Solution

  • You can use approxfun, splinefun, loess, ... to fill up missing values.

    res <- cbind(a = approxfun(temp_dt_1$x, temp_dt_1$y, rule=2)(temp_dt_1$x),
    s = splinefun(temp_dt_1$x, temp_dt_1$y)(temp_dt_1$x), 
    l = predict(loess(y ~ x, temp_dt_1, control = loess.control(surface = "direct")), temp_dt_1))
    
    plot(temp_dt_1$x, temp_dt_1$y, ylim=range(res))
    for(i in 1:3) lines(temp_dt_1$x, res[,i], col=i, lwd=2)
    legend("topright", c("approxfun", "spline", "loess"), lty=1, lwd=2, col=1:3)
    

    Plot of interpolated points