Search code examples
rregressionlinear-regressiondata-fittingpiecewise

Difficulty fitting piecewise linear data in R


I have the following data (cost of a product vs. time) that looks like the following:

annum <- c(1903, 1904, 1905, 1906, 1907, 1908, 1909, 1910, 1911, 1912, 1913, 
    1914, 1915, 1916, 1917, 1918, 1919)
cost <- c(0.0000,  18.6140,  92.1278, 101.9393, 112.0808, 122.5521, 
    133.3532, 144.4843, 244.5052, 275.6068, 295.2592, 317.3145, 
    339.6527, 362.3537, 377.7775, 402.8443, 437.5539)

mydata <- as.data.frame(cbind(annum, cost))

g <- ggplot(mydata, aes(x = annum, y = cost))
g <- g + geom_point()
g <- g + scale_y_continuous(labels=scales::dollar_format())
g

This is the resulting plot of this data using this code The plot shows something that looks piecewise linear to me; there's a step from 1904 to 1905; then a clear line from 1905 to 1910; then a step; and then another line from 1911 to the end. (The first point (1903, 0) is fictitious.)

I've tried to use the segmented package to model this, but instead of picking something like 1904.5 and 1910.5 as breakpoints, it finds two points between 1911 and 1912.

I've tried some other techniques (e.g., "brute force" from "The R Book," and direct fitting), but I clearly don't understand this as much as I need to. Any help would be very much appreciated.

Ideally, I would end up with an equation for each segment and a single plot showing the piecewise fit and a confidence interval for the fit.


Solution

  • One can use package struccchange for this. Here a simplified code version:

    library("strucchange")
    
    startyear <- startyear
    cost <- c(0.0000,  18.6140,  92.1278, 101.9393, 112.0808, 122.5521, 
              133.3532, 144.4843, 244.5052, 275.6068, 295.2592, 317.3145, 
              339.6527, 362.3537, 377.7775, 402.8443, 437.5539)
    
    ts <- ts(cost, start=1903)
    plot(ts)
    
    ## for small data sets you might consider to reduce segment length
    bp <- breakpoints(ts ~ time(ts), data=ts, h = 5)
    
    ## BIC selection of breakpoints
    plot(bp)
    breakdates(bp)
    fm1 <- lm(ts ~ time(ts) * breakfactor(bp), data=ts)
    coef(fm1)
    
    plot(ts, type="p")
    lines(ts(fitted(fm1),  start = startyear),  col = 4)
    lines(bp)
    confint(bp)
    
    lines(confint(bp))
    

    More can be found in the package vignette or one of the related publications, e.g. https://doi.org/10.18637/jss.v007.i02 So it is for example possible to make significance tests, to estimate confidence intervals or to include covariates.

    A segment length of 2 is not possible, because residual variance cannot be estimated. Similarly confidence intervals can only be estiated if segments are long enough. Therefore, only one breakpoint is shown below, while the excellent answer of @Rui Barradas omits confidence intervals but shows two breakpoints.

    one breakpoint

    Her an example without the first two points and an additional assumption to estimate the confidence interval in case of a small segment:

    library("strucchange")
    
    startyear <- 1905
    cost <- c(92.1278, 101.9393, 112.0808, 122.5521, 
              133.3532, 144.4843, 244.5052, 275.6068, 295.2592, 317.3145, 
              339.6527, 362.3537, 377.7775, 402.8443, 437.5539)
    
    ts <- ts(cost, start=startyear)
    bp <- breakpoints(ts ~ time(ts), data=ts, h = 5)
    fm1 <- lm(ts ~ time(ts) * breakfactor(bp), data=ts)
    plot(ts, type="p")
    lines(ts(fitted(fm1),  start = startyear),  col = 4)
    lines(confint(bp, het.err=FALSE))
    

    first two values omitted

    Edit:

    • bugs of the original version corrected
    • coefficients and confidence interval added
    • images added
    • example with omitted first 2 values added