Search code examples
rtime-seriesmissing-dataimputationimputets

Time series Imputation based on ID


I am working on a time series data. The dataset is:

datALL <- read.table(header=TRUE, text="
                   ID  Year Align
                   A01    2017  329
                   A01    2016  NA
                   A01    2015  NA
                   A01    2014  314
                   A01    2013  NA
                   A01    2012  NA
                   A01    2011  432
                   A02    2017  4536
                   A02    2016  NA
                   A02    2015  NA
                   A02    2014  2345
                   A02    2013  NA
                   A02    2012  NA
                   A02    2011  1932
                   ")
datALL
    ID Year Align
1  A01 2017   329
2  A01 2016    NA
3  A01 2015    NA
4  A01 2014   314
5  A01 2013    NA
6  A01 2012    NA
7  A01 2011   432
8  A02 2017  4536
9  A02 2016    NA
10 A02 2015    NA
11 A02 2014  2345
12 A02 2013    NA
13 A02 2012   NA
14 A02 2011  1932

I want to impute missing value using imputeTS package. The package works fine for individual ID.

datA01 <- read.table(header=TRUE, text="
                   ID  Year Align
                   A01    2017  329
                   A01    2016  NA
                   A01    2015  NA
                   A01    2014  314
                   A01    2013  NA
                   A01    2012  NA
                   A01    2011  432
                   ")
datA01
   ID Year Align
1 A01 2017   329
2 A01 2016    NA
3 A01 2015    NA
4 A01 2014   314
5 A01 2013    NA
6 A01 2012    NA
7 A01 2011   432


### install.packages("imputeTS")
library(imputeTS)
datA01$Year <- ts(datA01[, c(2)])
datA01$Align1 <- na_kalman(datA01$Align)
dat1
   ID Year Align   Align1
1 A01 2017   329 329.0000
2 A01 2016    NA 318.9847
3 A01 2015    NA 312.7852
4 A01 2014   314 314.0000
5 A01 2013    NA 347.2150
6 A01 2012    NA 387.7720
7 A01 2011   432 432.0000

For A02 it also works perfectly:

datA02 <- read.table(header=TRUE, text="
                   ID  Year Align
                   A02    2017  4536
                   A02    2016  NA
                   A02    2015  NA
                   A02    2014  2345
                   A02    2013  NA
                   A02    2012  NA
                   A02    2011  1932
                   ")
datA02$Year <- ts(datA02[, c(2)])
datA02$Align1 <- na_kalman(datA02$Align)
datA02 
   ID Year Align   Align1
1 A02 2017  4536 4536.000
2 A02 2016    NA 3510.613
3 A02 2015    NA 3168.817
4 A02 2014  2345 2345.000
5 A02 2013    NA 2485.226
6 A02 2012    NA 2143.431
7 A02 2011  1932 1932.000

For all data together, it will not work because it takes all 14 years as a continuous time-series. It should take each seven-year based on the ID. I need help in getting a loop function which can take care of that.

datALL$Year <- ts(datALL[, c(2)])
datALL$Align1 <- na_kalman(datALL$Align)
#### WRONG IMPUTATION DUE TO FAILUE IN SEPARATING YEARS BY ID
datALL
    ID Year Align    Align1
1  A01 2017   329  329.0000
2  A01 2016    NA  808.8287
3  A01 2015    NA  968.7716
4  A01 2014   314  314.0000
5  A01 2013    NA 1288.6573
6  A01 2012    NA 1448.6002
7  A01 2011   432  432.0000
8  A02 2017  4536 4536.0000
9  A02 2016    NA 1928.4289
10 A02 2015    NA 2088.3718
11 A02 2014  2345 2345.0000
12 A02 2013    NA 2408.2575
13 A02 2012    NA 2568.2004
14 A02 2017  1932 1932.0000

CORRECT DATA WOULD BE LIKE THIS

   ID Year Align   Align1
1 A01 2017   329 329.0000
2 A01 2016    NA 318.9847
3 A01 2015    NA 312.7852
4 A01 2014   314 314.0000
5 A01 2013    NA 347.2150
6 A01 2012    NA 387.7720
7 A01 2011   432 432.0000
8 A02 2017  4536 4536.000
9 A02 2016    NA 3510.613
10 A02 2015    NA 3168.817
11 A02 2014  2345 2345.000
12 A02 2013    NA 2485.226
13 A02 2012    NA 2143.431
14 A02 2011  1932 1932.000

Solution

  • dplyr is your friend:

    library(imputeTS)
    library(dplyr)
    
    datALL %>%
      group_by(ID) %>%
      mutate(Align1 = na_kalman(Align))
    

    Result:

        ID Year Align    Align1
    1  A01 2017   329  329.0000
    2  A01 2016    NA  318.9847
    3  A01 2015    NA  312.7852
    4  A01 2014   314  314.0000
    5  A01 2013    NA  347.2150
    6  A01 2012    NA  387.7720
    7  A01 2011   432  432.0000
    8  A02 2017  4536 4536.0000
    9  A02 2016    NA 3510.6131
    10 A02 2015    NA 3168.8175
    11 A02 2014  2345 2345.0000
    12 A02 2013    NA 2485.2262
    13 A02 2012    NA 2143.4306
    14 A02 2011  1932 1932.0000
    

    Note that if we turn year to ts in the mutate, we get the following warning message because we are attempting to combine two different time series together:

    Warning messages: 1: In mutate_impl(.data, dots) : Vectorizing 'ts' elements may not preserve their attributes 2: In mutate_impl(.data, dots) : Vectorizing 'ts' elements may not preserve their attributes

    Turning year to a ts is actually not necessary (and incorrect) since na_kalman takes numeric vectors as input and the Align variables are the ones that take the values of the time series.