Search code examples
rregressionpiecewise

How to Create Piecewise Constant (Bin Smooth) Model in R?


I've looked all over for an answer to this question.

If you have an explanatory variable x and a response y, how can you fit a piecewise constant regression model in R?

I know the segmented package may be used to create a piecewise non-constant model, but I cannot figure out how to constrain the slope of each line segment to be 0. I need to be able to use the model for prediction, which is why I cannot simply use the regressogram function.

Thanks for any help,

Jack


Solution

  • You can do this in base R using approxfun by including the argument method = "constant" Since you don't provide data, I made an example using data built into R.

    StepFun = approxfun(x=iris$Sepal.Length, 
        y = iris$Sepal.Width, method = "constant")
    

    Edit

    I now think that the question is to break the range of x into bins and create a piecewise constant function (using the mean value of y per bin). I am giving two versions of this. One that is easier and one that matches the OP's comments better. Both of these are done using cut to bin the data.

    Version 1: Specify the endpoints of the bins

    This is easy if you just want to specify the bins themselves. Notice that I am plotting with a large number of intermediate points. This avoids the appearance of any slanted regions in the plot.

    ## To specify break boundaries
    BREAKS = seq(4,8,0.5)
    BINS  = cut(iris$Sepal.Length, breaks=BREAKS, labels=FALSE)
    MEANS = aggregate(iris$Sepal.Length, list(BINS), mean)$x
    
    Step2 = approxfun(x=BREAKS[-1], y = MEANS, method = "constant")
    curve(Step2, xlim=c(4.5,8),n=1001)
    

    Step 2

    Version 2: Specify the number of points per bin

    The goal of this version is not to have the bins be the same width, but instead to contain (approximately) the same number of points per bin. You can't actually guarantee that this can be done. If there are multiple x values that are the same in your data, it may not be possible to get exactly the same number of points per bin, but this will get you as close as possible. The idea is to use quantiles to tell you bin boundaries that approximate the same number of points per bin.

    ## To specify number of points per bin
    PointsPerBin = 15
    Q = seq(0,1, PointsPerBin/length(iris$Sepal.Length))
    QBREAKS = quantile(iris$Sepal.Length, Q)
    QBINS  = cut(iris$Sepal.Length, breaks=QBREAKS, labels=FALSE)
    QMEANS = aggregate(iris$Sepal.Length, list(QBINS), mean)$x
    
    Step3 = approxfun(x=QBREAKS[-1], y = QMEANS, method = "constant")
    curve(Step3, xlim=c(4.5,8),n=1001)
    

    Again, if you use a small number of points, it will look like there are slanted regions in the plot.

    Step 3