Search code examples
rdata.tableapplyspline

How to replace NA values in a data.table with na.spline


I'm trying to prepare some demographic data retrieved from Eurostat for further processing, amongst others replacing any missing data with corresponding approximated ones.

First I was using data.frames only, but then I got convinced that data.tables might offer some advantages over regular data.frames, so I migrated to data.tables.

One thing I've observed while doing so was getting different results when using "na.spline" in combination with "apply" versus "na.spline" as part of the data.table.

#1 source data

(dt0 <- data.table(
            "age,sex,geo\\time" = c("TOTAL,F,AD", "TOTAL,F,AL", "TOTAL,F,AM", "TOTAL,F,AT", "TOTAL,F,AZ"),
            "2014" = c(NA,    NA,      NA,      4351253, NA),
            "2013" = c(37408, NA,      NA,      4328238, 4707690),
            "2012" = c(38252, NA,      1684000, 4309977, 4651601),
            "2011" = c(38252, 1409931, 1679066, 4296293, 4594023),
            "2010" = c(40296, NA,      1673656, 4285442, 4542083)
        ))

Which generates

       age,sex,geo\\time    2014    2013    2012    2011    2010
    1:        TOTAL,F,AD      NA   37408   38252   38252   40296
    2:        TOTAL,F,AL      NA      NA      NA 1409931      NA
    3:        TOTAL,F,AM      NA      NA 1684000 1679066 1673656
    4:        TOTAL,F,AT 4351253 4328238 4309977 4296293 4285442
    5:        TOTAL,F,AZ      NA 4707690 4651601 4594023 4542083

Split into "row headings"...

(dt0a <- dt0[, 1, with=FALSE])
(cn0a <- colnames(dt0a))

...and population data...

(dt0b <- dt0[, 2:ncol(dt0), with=FALSE])
(cn0ba <- colnames(dt0b))

#2 replace NAs by "na.spline" & "apply"

(dt1b <- data.table(t(apply(dt0b, 1, na.spline, na.rm=FALSE))))
(setnames(dt1b, cn0b))
(dt1 <- cbind(dt0a, dt1b))

Which results in...

       age,sex,geo\\time    2014    2013    2012    2011    2010
    1:        TOTAL,F,AD   32832   37408   38252   38252   40296
    2:        TOTAL,F,AL 1409931 1409931 1409931 1409931 1409931
    3:        TOTAL,F,AM 1692440 1688458 1684000 1679066 1673656
    4:        TOTAL,F,AT 4351253 4328238 4309977 4296293 4285442
    5:        TOTAL,F,AZ 4755163 4707690 4651601 4594023 4542083

#3 replace NAs within "data.table"

(dt2b <- dt0b[,lapply(.SD, na.spline, na.rm=FALSE)])
(dt2 <- cbind(dt0a, dt2b))

Et voila...

       age,sex,geo\\time    2014    2013      2012    2011      2010
    1:        TOTAL,F,AD 4351253   37408   38252.0   38252   40296.0
    2:        TOTAL,F,AL 4351253 1993097 -611513.8 1409931 -629363.2
    3:        TOTAL,F,AM 4351253 3423374 1684000.0 1679066 1673656.0
    4:        TOTAL,F,AT 4351253 4328238 4309977.0 4296293 4285442.0
    5:        TOTAL,F,AZ 4351253 4707690 4651601.0 4594023 4542083.0

#4 compare results

(identical(dt1, dt2))

Considering the aforementioned not quite a surprise...

    [1] FALSE

(The values calculated as a replacement for NAs with approach #2 are the ones I'd be interested in, only generated via approach #3).

Reason for pursuing the "data.table" route (approach #3) is one of performance (as it has been pointed out in various posts, when using "apply" a matrix operation is getting carried out, which is taking a considerably longer time than a corresponding approach facilitating "data.table" only).

Being pretty new to R I reckon I've done something quite fundamentally wrong, only thing is, I haven't got the faintest idea what this might be.

Any help pointing me in the right direction is more than appreciated!

-Sil68


Solution

  • Using a matrix. Using a matrix operation on a matrix is not slow:

    mat           <- t(as.matrix(dt0[,-1]))
    colnames(mat) <- dt0[[1]]
    mat[]         <- na.spline(mat,na.rm=FALSE)
    

    which gives

         TOTAL,F,AD TOTAL,F,AL TOTAL,F,AM TOTAL,F,AT TOTAL,F,AZ
    2014      32832    1409931    1692440    4351253    4755163
    2013      37408    1409931    1688458    4328238    4707690
    2012      38252    1409931    1684000    4309977    4651601
    2011      38252    1409931    1679066    4296293    4594023
    2010      40296    1409931    1673656    4285442    4542083
    

    Using a data.table. If you instead want to use a data.table, do

    mat           <- t(as.matrix(dt0[,-1]))
    colnames(mat) <- dt0[[1]]
    DT            <- data.table(mat,keep.rownames=TRUE)
    DT[,(vn):=lapply(.SD,na.spline,na.rm=FALSE),.SDcols=vn]
    

    which updates DT by reference, giving

         rn TOTAL,F,AD TOTAL,F,AL TOTAL,F,AM TOTAL,F,AT TOTAL,F,AZ
    1: 2014      32832    1409931    1692440    4351253    4755163
    2: 2013      37408    1409931    1688458    4328238    4707690
    3: 2012      38252    1409931    1684000    4309977    4651601
    4: 2011      38252    1409931    1679066    4296293    4594023
    5: 2010      40296    1409931    1673656    4285442    4542083
    

    Benchmarking:

    mat           <- t(as.matrix(dt0[,-1]))
    colnames(mat) <- dt0[[1]]
    DT            <- data.table(mat,keep.rownames=TRUE)
    vn            <- names(DT)[-1]
    tvn           <- names(dt0)[-1]
    
    require(microbenchmark)
    microbenchmark(
      transp = dt0[,as.list(na.spline(unlist(.SD), na.rm=FALSE)),by=1:nrow(dt0),.SDcols=tvn],
      lapply = DT[,lapply(.SD,na.spline,na.rm=FALSE),.SDcols=vn],
      apply  = apply(mat,2,na.spline,na.rm=FALSE),
      fun    = na.spline(mat,na.rm=FALSE),
    times=10)
    

    Results:

    Unit: milliseconds
       expr      min       lq     mean   median       uq      max neval
     transp 4.666934 4.734891 4.850268 4.787690 4.897202 5.259957    10
     lapply 3.923823 4.010356 4.327646 4.039445 4.049957 6.976446    10
      apply 2.505556 2.525601 2.578890 2.585978 2.592090 2.758801    10
        fun 1.945290 1.994178 2.063063 2.068490 2.085112 2.272846    10
    

    The "transp" result shows the timing of @shadow's solution, which preserves the OP's formatting. apply is not necessary here, thanks to how na.spline works.