Search code examples
rbigdatalinear-interpolation

Bigdata linear interpolation in R


I've a big data set with some NA values. The sample data is below.

Data <-   data.frame(col_1 = c('A','A','A','A', 'A', 'A', 'A', 'B', 'B', 'B'), col_2 = c('C','C', 'C', 'D', 'D','D', 'D', 'E', 'E', 'E'), col_3 = c(10,15,20, 10,20,25,30,5,10,15), value = c(0.9, NA, 0.6, 0.9, NA, NA,0.4, 0.8,NA,0.4))

I want to fill those NAs with linear interpolation. For example to fill the NA for col_1 = ‘A’ and col_2 = ‘C’

value = 0.9 + (0.6-0.9)*(15-10)/(20-10) =  0.75

And for the second NA for col_1 = ‘A’ and col_2 = ‘D’

value = 0.9 + (0.4-0.9)*(25-10)/(30-10) =  0.53

Is there an efficient way of doing it since my data is big? Thank you. The expected outcome is.

Data_Updated <- data.frame(col_1 = c('A','A','A','A', 'A', 'A', 'A', 'B', 'B', 'B'), col_2 = c('C','C', 'C', 'D', 'D','D', 'D', 'E', 'E', 'E'), col_3 = c(10,15,20, 10,20,25,30,5,10,15), value = c(0.9, 0.75, 0.6, 0.9, 0.65, 0.53,0.4, 0.8,0.6,0.4))

Solution

  • Try if this is fast enough:

    library(data.table)
    library(zoo)
    
    setDT(Data)
    Data[, value1 := na.approx(value, x = col_3), by = .(col_1, col_2)]
    #    col_1 col_2 col_3 value value1
    # 1:     A     C    10   0.9  0.900
    # 2:     A     C    15    NA  0.750
    # 3:     A     C    20   0.6  0.600
    # 4:     A     D    10   0.9  0.900
    # 5:     A     D    20    NA  0.650
    # 6:     A     D    25    NA  0.525
    # 7:     A     D    30   0.4  0.400
    # 8:     B     E     5   0.8  0.800
    # 9:     B     E    10    NA  0.600
    #10:     B     E    15   0.4  0.400