Search code examples
rprobabilitypredictionkernel-densityprobability-density

Probability Density Functions in R for predicting next value of incidents


I need to do Probability Density Prediction of the following data in R:

year = c(1971, 1984, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 
2007, 2008, 2009, 2010, 2011, 2012, 2013)
incidents = c(1, 1, 1, 1, 3, 1, 6, 6, 9, 11, 21, 37, 38, 275, 226, 774, 1064)

The are data.frame in R like:

dat <- data.frame(year,incidents)

The goal and idea is to make predictions based on a few years and "predict" for the last year of the data available.

I'm new in R so any suggestions, advise and so forth is welcome. Thanks in advance.


Solution

  • Broadly, the two main approaches to modeling are the so-called "mechanistic" and "empirical" approaches. Both have their adherents (and their detractors). The mechanistic approach asserts that modeling should proceed from an understanding of the underlying phenomena (mechanism), which is then translated to some type of mathematical equation(s), which are then fit to the data (to test the mechanism). The empirical approach assembles a (usually long) list of models (equations) and seeks to find the one that "fits best". Empirical modeling is appealing but dangerous because assessing when you have a "good fit" is not trivial - although it is often treated that way.

    You have not given us nearly enough information to formulate a mechanistic model, so here's an illustration of a couple of empirical models, as a cautionary tale:

    Finite-time singularity models are popular with your type of data. Among other things, these models are used to "predict" stock market bubbles (the LPPL model). The basic idea is that a catastrophe (singularity) is coming, and we want to predict when. So we use a function of the form:

    y = a × (c-x)b

    With b < 0, y approaches a singularity as x -> c.

    In R code, we can fit a model like this as follows:

    # Finite-Time Singularity Model
    library(minpack.lm)
    f <- function(par,x) {
      a <- par[1]
      b <- par[2]
      c <- par[3]
      a * (c - x)^b
    }
    resid   <- function(par,obs,xx) {obs-f(par,xx)}
    start <- c(a=1, b=-1, c=2100)
    nls.out <- nls.lm(par=start, fn=resid, obs =dat$incidents, xx=dat$year, 
                      control = nls.lm.control(maxiter=500))
    coef(nls.out)
    with(dat, plot(incidents~year, main="Finite-Time Singularity Model"))
    lines(dat$year,f(coef(nls.out),year), col=2, lwd=2)
    

    This gives what appears to be a "pretty good fit":

    In fact, the model overstates incidents early on, and tends to understate them later (which is terrible because we want a prediction for the future). The residuals plot shows this clearly.

    with(dat,plot(year,resid(coef(nls.out),incidents,year),
                  main="Residuals Plot", ylab="residuals"))
    

    Another approach notes that your data is "counts" (e.g. number of incidents per year). This suggests a generalized linear model in the poisson family:

    # generalized liner model, poisson family
    fit.glm <- glm(incidents ~year,data=dat,family=poisson)
    with(dat,plot(incidents~year))
    lines(dat$year,predict(fit.glm,type="response"), col=2, lwd=2)
    par(mfrow=c(2,2))
    plot(fit.glm)
    

    This fit is better, but still not very good, as the diagnostic plots show. The residuals follow a trend, they are not normally distributed, and some of the data points have unacceptably high leverage.