Search code examples
rggplot2loess

Definition of intervals with loess method


Usually I have difficulties to understand how loess interpolation works, in particular the definition of intervals. Often I am not able to understand why they do not appear. As far as I know it is needed at least 5 data points but in some case even with more than 5 I am not able to see them.

I report below the last example I have:

# Dataframe 1 creation
temperatura <- c(120, 130, 140, 150, 160, 120, 130, 140, 150, 160)
viscosita <- c(65000, 45000, 29500, 20500, 15500, 65000, 45000, 29500, 20500, 15500)
df <- data.frame(Temperatura = temperatura, Viscosita = viscosita)

library(ggplot2)

Graph1 <- ggplot(data=df, aes(x=`Temperatura`, y=`Viscosita`)) +
geom_point(aes(color = "red"), shape = 1, size = 3.0, stroke = 1.5) + 
geom_smooth(aes(level=0.99, span = 0.1)) 

Graph1

enter image description here

But if I change slightly one point, I get them:

# Dataframe 2 creation
temperatura <- c(120, 130, 140, 150, 160, 120, 130, 140, 150, 160)
viscosita <- c(65000, 45000, 29500, 20500, 15500, 65000, 43500, 29500, 20500, 15500)
df <- data.frame(Temperatura = temperatura, Viscosita = viscosita)

library(ggplot2)

Graph2 <- ggplot(data=df, aes(x=`Temperatura`, y=`Viscosita`)) +
geom_point(aes(color = "red"), shape = 1, size = 3.0, stroke = 1.5) + 
geom_smooth(aes(level=0.99, span = 0.1)) 

Graph2

enter image description here

I would like better to understand how it works. I have tried to change parameters n (number of window-points where it make the loess calculation), level (% of confidence interval), span, but I was not able to get the bands on Graph1.


Solution

  • If you wish to supply arguments to the loess method in geom_smooth, you need to put them in a list and pass them to the method.args argument.

    However, the main issue you have with the confidence intervals is just that you are essentially getting a perfect fit with the default degree-2 polynomial:

    loess(viscosita ~ temperatura, data = df)
    #> Call:
    #> loess(formula = viscosita ~ temperatura, data = df)
    #>
    #> Number of Observations: 10 
    #> Equivalent Number of Parameters: 4.53 
    #> Residual Standard Error: 1.41e-11 
    

    The standard errors are so small that you cannot see them in a plot:

    predict(loess(viscosita ~ temperatura, data = df), se = TRUE)
    #> $fit
    #>     1     2     3     4     5     6     7     8     9    10 
    #> 65000 45000 29500 20500 15500 65000 45000 29500 20500 15500 
    #> 
    #> $se.fit
    #>            1            2            3            4            5            6 
    #> 9.967537e-12 9.967537e-12 9.967537e-12 9.967537e-12 9.967537e-12 9.967537e-12 
    #>            7            8            9           10 
    #> 9.967537e-12 9.967537e-12 9.967537e-12 9.967537e-12 
    #> 
    #> $residual.scale
    #> [1] 1.409623e-11
    #> 
    #> $df
    #> [1] 4.80231
    

    So the plot is showing the standard errors, they are just several orders of magnitude smaller than a pixel wide. If you change to, say, a degree-1 fit, then you would see confidence intervals, but the fit isn't as good:

    ggplot(data = df, aes(x = Temperatura, y = Viscosita)) +
      geom_point(aes(color = "red"), shape = 1, size = 3.0, stroke = 1.5) + 
      geom_smooth(method = 'loess', method.args = list(degree = 1))
    

    enter image description here

    I think though that ultimately loess may be the wrong choice for drawing a smooth curve here. If you have a theoretical model to which you are trying to fit your data, you could use non-linear least squares. It looks as though you are modelling how viscosity in a liquid changes with temperature, in which case

    Where A and B are constants is a commonly used model1.

    To do this, we need to get nls to produce standard errors, which requires the add-on package investr and a couple of helper functions:

    nls_se <- function(formula, data, start, type = "confidence", ...) {
      mod <- nls(formula, data, start)
      class(mod) <- "nls_se"
      attr(mod, "type") <- type
      mod
    }
    
    predict.nls_se <- function(model, newdata, level = 0.9, interval, ...) {
      class(model) <- "nls"
      p <- investr::predFit(model, newdata = newdata, 
                            interval = attr(model, "type"), level = level)
      list(fit = p, se.fit = p[,3] - p[,1])
    }
    

    This now allows:

    mod <- nls(Viscosita ~ A * exp(B/Temperatura), 
               data = df, start = list(A = 1000, B = 200))
    
    ggplot(data = df, aes(x = Temperatura, y = Viscosita)) +
      geom_point(aes(color = "red"), shape = 1, size = 3.0, stroke = 1.5) + 
      geom_smooth(method = nls_se, formula = y ~ A * exp(B/x),
                  method.args = list(start = list(A = 1000, B = 200))) +
      annotate('text', x = 140, y = 60000, 
               label = bquote(Viscosita == .(round(coef(mod)[1], 2)) * 
                        e^{.(round(coef(mod)[2], 2)) / Temperatura}),
               check_overlap = TRUE)
    

    enter image description here

    Which is nice because it gives you both the standard error on your plot and a formula for the best fitting line based on a theoretical model.