Search code examples
rlinear-regressionpiecewise

Broken stick (or piecewise) regression with 2 breakpoints


I want to estimate two breakpoints of a function with the next data:

    df = data.frame (x = 1:180,
                y = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 2, 2, 4, 2, 2, 3, 2, 1, 2,0, 1, 0, 1, 4, 0, 1, 2, 3, 1, 1, 1, 0, 2, 0, 3,  2, 1, 1, 1, 1, 5, 4, 2, 1, 0, 2, 1, 1, 2, 0, 0, 2, 2, 1, 1, 1, 0, 0, 0, 0, 
                    2, 3, 0, 3, 2, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0))
# plotting y ~ x 
plot(df)

enter image description here

I know that the function have two breakpoints such that:

y = y1 if x < b1;
y = y2 if b1 < x < b2;
y = y3 if b2 < x;

And I want to find b1 and b2 to fit a kind of rectangular function with the following form

enter image description here

Can anyone help me or point me in the right direction? Thanks!


Solution

  • 1) kmeans Try kmeans like this:

    set.seed(123)
    km <- kmeans(df, 3, nstart = 25)
    
    > fitted(km, "classes") # or equivalently km$cluster
      [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
     [38] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1
     [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
    [112] 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
    [149] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
    
    > unique(fitted(km, "centers")) # or except for order km$centers
          x         y
    3  30.5 0.5166667
    1  90.5 0.9000000
    2 150.5 0.0000000
    
    > # groups are x = 1-60, 61-120 and 121-180
    > simplify2array(tapply(df$x, km$cluster, range))
           1   2  3
    [1,]  61 121  1
    [2,] 120 180 60
    
    plot(df, col = km$cluster)
    lines(fitted(km)[, "y"] ~ x, df)
    

    screenshot

    2) brute force Another approach is a brute force approach in which we calculate every possible pair of breakpoints and choose the pair whose sum of squares in a linear model is least.

    grid <- subset(expand.grid(b1 = 1:180, b2 = 1:80), b1 < b2)
    
    # the groups are [1, b1], (b1, b2], (b2, Inf)
    fit <- function(b1, b2, x, y) {
       grp <- factor((x > b1) + (x > b2))
       lm(y ~ grp)
    }
    
    dv <- function(...) deviance(fit(...))
    
    wx <- which.min(mapply(dv, grid$b1, grid$b2, MoreArgs = df))
    
    grid[wx, ]
    ##       b1 b2
    ## 14264 44 80
    
    plot(df)
    lines(fitted(fit(grid$b1[wx], grid$b2[wx], x, y)) ~ x, df)
    

    screenshot