Search code examples
pythonrapproximation

Estimate future growth using sample of historical data


I have historical records of the growth (in terms of size) of our database for past couple of years. I am trying to figure out the best way/graph that can show me the future growth of database based on the historical records, of course this won't help if we add a new table and that would grow too, but I am just looking for a way to estimate it. I am open to ideas in Python or R

Here is the size of the database in TB over years:

3.895 - 2012
6.863 - 2013
8.997 - 2014
10.626 - 2015


Solution

  • d <- data.frame(x= 2012:2015,
                y = c(3.895, 6.863, 8.997, 10.626))
    

    You can visualize the fit (and its projection): here I'm comparing an additive and a polynomial model. I'm not sure I believe the confidence intervals on the additive model, though:

    library("ggplot2"); theme_set(theme_bw())
    ggplot(d,aes(x,y))+ geom_point() +
        expand_limits(x=2018)+
        geom_smooth(method="lm",formula=y~poly(x,2),
                    fullrange=TRUE,fill="blue")+
        geom_smooth(method="gam",formula=y~s(x,k=3),colour="red",
                    fullrange=TRUE,fill="red")
    

    enter image description here

    I'm a little shocked the quadratic relationship is so close.

    summary(m1 <- lm(y~poly(x,2),data=d))
    ## Residual standard error: 0.07357 on 1 degrees of freedom
    ## Multiple R-squared:  0.9998, Adjusted R-squared:  0.9994 
    ## F-statistic:  2344 on 2 and 1 DF,  p-value: 0.0146
    

    Predict:

    predict(m1,newdata=data.frame(x=2016:2018),interval="confidence")
    ##        fit      lwr      upr
    ## 1 11.50325 8.901008 14.10549
    ## 2 11.72745 6.361774 17.09313
    ## 3 11.28215 2.192911 20.37139
    

    Did you make up these numbers, or are they real data?

    The forecast() package would be better for more sophisticated methods.