Search code examples
rdataframeinterpolation

How to interpolate data with R?


I have a data frame that looks like this:

z <- data.frame(ent = c(1, 1, 1, 2, 2, 2, 3, 3, 3), 
                year = c(1995, 2000, 2005, 1995, 2000, 2005, 1995, 2000, 2005), 
                pobtot = c(50, 60, 70, 10, 4, 1, 100, 105, 110))

As you can see, there is a gap between 5 years for every ent. I want to interpolate data to every missing year: 1996, 1997, 1998, 1999, 2001, 2002, 2003, 2004 and also prognosticate to 2006, 2007 and 2008. Is there a way to do this?

Any help would be appreciated.


Solution

  • Assuming you want linear interpolation, R uses approx() for such things by default, e.g. for drawing lines in a plot. We may also use that function to interpolate the years. It doesn't extrapolate, though, but we could use forecast::ets() with default settings for this which calculates an exponential smoothing state space model. Note, however, that this may also produce negative values, but OP hasn't stated what is needed in such a case. So anyway in a by() approach we could do:

    library(forecast)
    p <- 3  ## define number of years for prediction
    
    res <- do.call(rbind, by(z, z$ent, function(x) {
      yseq <- min(x$year):(max(x$year) + p)  ## sequence of years + piction
      a <- approx(x$year, x$pobtot, head(yseq, -p))$y  ## linear interpolation
      f <- predict(ets(a), 3) ## predict `p` years
      r <- c(a, f$mean)  ## combine interpolation and prediction
      data.frame(ent=x$ent[1], year=yseq, pobtot=r)  ## output as data frame
    }))
    

    Result

    res
    #      ent year pobtot
    # 1.1    1 1995   50.0
    # 1.2    1 1996   52.0
    # 1.3    1 1997   54.0
    # 1.4    1 1998   56.0
    # 1.5    1 1999   58.0
    # 1.6    1 2000   60.0
    # 1.7    1 2001   62.0
    # 1.8    1 2002   64.0
    # 1.9    1 2003   66.0
    # 1.10   1 2004   68.0
    # 1.11   1 2005   70.0
    # 1.12   1 2006   72.0
    # 1.13   1 2007   74.0
    # 1.14   1 2008   76.0
    # 2.1    2 1995   10.0
    # 2.2    2 1996    8.8
    # 2.3    2 1997    7.6
    # 2.4    2 1998    6.4
    # 2.5    2 1999    5.2
    # 2.6    2 2000    4.0
    # 2.7    2 2001    3.4
    # 2.8    2 2002    2.8
    # 2.9    2 2003    2.2
    # 2.10   2 2004    1.6
    # 2.11   2 2005    1.0
    # 2.12   2 2006    0.4
    # 2.13   2 2007   -0.2
    # 2.14   2 2008   -0.8
    # 3.1    3 1995  100.0
    # 3.2    3 1996  101.0
    # 3.3    3 1997  102.0
    # 3.4    3 1998  103.0
    # 3.5    3 1999  104.0
    # 3.6    3 2000  105.0
    # 3.7    3 2001  106.0
    # 3.8    3 2002  107.0
    # 3.9    3 2003  108.0
    # 3.10   3 2004  109.0
    # 3.11   3 2005  110.0
    # 3.12   3 2006  111.0
    # 3.13   3 2007  112.0
    # 3.14   3 2008  113.0
    

    We could quickly check this in a plot, which, apart from the negative values of entity 2 looks quite reasonable.

    with(res, plot(year, pobtot, type='n', main='z'))
    with(res[res$year < 2006, ], points(year, pobtot, pch=20, col=3))
    with(res[res$year > 2005, ], points(year, pobtot, pch=20, col=4))
    with(res[res$year %in% z$year, ], points(year, pobtot, pch=20, col=1))
    abline(h=0, lty=3)
    legend(2005.25, 50, c('measurem.', 'interpol.', 'extrapol.'), pch=20,
           col=c(1, 3, 4), cex=.8, bty='n')
    

    ![enter image description here