Search code examples
rnapanel-dataextrapolation

How to EXTRAPOLATE missing data with R in panel data?


In a Panel Data, I have some NA values that I would like to extrapolate at the end or start of my years of interest. Most solutions to similar questions deal with interpolation. Please note that this is not the case.

A similar sample of my data called "data" looks like this:

> data
   REGION YEAR  VALUE
1       A 2011     NA
2       A 2012     NA
3       A 2013     NA
4       A 2014  20.00
5       A 2015  25.00
6       A 2016  30.00
7       A 2017  35.00
8       A 2018  40.00
9       B 2011     NA
10      B 2012   0.30
11      B 2013   0.50
12      B 2014   0.70
13      B 2015   0.90
14      B 2016   0.11
15      B 2017   0.13
16      B 2018   0.15
17      C 2011 100.00
18      C 2012 101.00
19      C 2013 102.00
20      C 2014 103.00
21      C 2015 104.00
22      C 2016 105.00
23      C 2017 106.00
24      C 2018     NA

Some solutions that I found to similar questions:

I have try with na_interpolate but it just repeats the last values. I also tried with mutate combined with na.spline, but it creates values that are obviously wrong for a linear regression

I am sure that there must be a simple way to estimate these values.

The expected result should look like this:

> data
   REGION YEAR  VALUE
1       A 2011   5.00
2       A 2012  10.00
3       A 2013  15.00
4       A 2014  20.00
5       A 2015  25.00
6       A 2016  30.00
7       A 2017  35.00
8       A 2018  40.00
9       B 2011   0.10
10      B 2012   0.30
11      B 2013   0.50
12      B 2014   0.70
13      B 2015   0.90
14      B 2016   0.11
15      B 2017   0.13
16      B 2018   0.15
17      C 2011 100.00
18      C 2012 101.00
19      C 2013 102.00
20      C 2014 103.00
21      C 2015 104.00
22      C 2016 105.00
23      C 2017 106.00
24      C 2018 107.00

Thank you for your help


Solution

  • You can just do:

    predictions <- round(predict(lm(VALUE ~ REGION * YEAR, df), newdata = df), 2)
    predictions
    #>      1      2      3      4      5      6      7      8      9     10 
    #>   5.00  10.00  15.00  20.00  25.00  30.00  35.00  40.00   0.65   0.59 
    #>     11     12     13     14     15     16     17     18     19     20 
    #>   0.53   0.46   0.40   0.34   0.27   0.21 100.00 101.00 102.00 103.00 
    #>     21     22     23     24 
    #> 104.00 105.00 106.00 107.00 
    
    df$VALUE[is.na(df$VALUE)] <- predictions[is.na(df$VALUE)]
    

    Which gives the desired result:

    df
    #>    REGION YEAR  VALUE
    #> 1       A 2011   5.00
    #> 2       A 2012  10.00
    #> 3       A 2013  15.00
    #> 4       A 2014  20.00
    #> 5       A 2015  25.00
    #> 6       A 2016  30.00
    #> 7       A 2017  35.00
    #> 8       A 2018  40.00
    #> 9       B 2011   0.65
    #> 10      B 2012   0.30
    #> 11      B 2013   0.50
    #> 12      B 2014   0.70
    #> 13      B 2015   0.90
    #> 14      B 2016   0.11
    #> 15      B 2017   0.13
    #> 16      B 2018   0.15
    #> 17      C 2011 100.00
    #> 18      C 2012 101.00
    #> 19      C 2013 102.00
    #> 20      C 2014 103.00
    #> 21      C 2015 104.00
    #> 22      C 2016 105.00
    #> 23      C 2017 106.00
    #> 24      C 2018 107.00