Search code examples
rfunctioncategorical-datanlsnonlinear-functions

Fitting nls model with a categorical variable


I want to fit a linear-plateau (nls) model describing height as a function of age, and I want to test if there are signiffiant differences for any parameter of the model between regions.

Here is what I have so far:

# Create data
df1 <- cbind.data.frame (height = c (0.5, 0.6, 0.9, 1.3, 1.5, 1.6, 1.6,
                                     0.6, 0.6, 0.8, 1.3, 1.5, 1.6, 1.5,
                                     0.6, 0.8, 1.0, 1.4, 1.6, 1.6, 1.6,
                                     0.5, 0.8, 1.0, 1.3, 1.6, 1.7, 1.6),
                         age = c (0.5, 0.9, 3.0, 7.3, 12.2, 15.5, 20.0,
                                  0.4, 0.8, 2.3, 8.5, 11.5, 14.8, 21.3,
                                  0.5, 1.0, 5.1, 11.1, 12.3, 16.0, 19.8,
                                  0.5, 1.1, 5.5, 10.2, 12.2, 15.4, 20.5),
                         region = as.factor (c (rep ("A", 7),
                                                rep ("B", 7),
                                                rep ("C", 7),
                                                rep ("D", 7))))

> head (df1)
  height  age region
1    0.5  0.5      A
2    0.6  0.9      A
3    0.9  3.0      A
4    1.3  7.3      A
5    1.5 12.2      A
6    1.6 15.5      A

# Create linear-plateau function
lp <- function(x, a, b, c){
  ifelse (x < c, a + b * x, a + b * c)
  } # Where 'a' is the intercept, 'b' the slope and 'c' the breakpoint

# Fit the model ignoring region
m1 <- nls (height ~ lp (x = age, a, b, c),
           data = df1,
           start = list (a = 0.5, b = 0.1, c = 13))

> summary (m1)

Formula: height ~ lp(x = age, a, b, c)

Parameters:
   Estimate Std. Error t value Pr(>|t|)    
a  0.582632   0.025355   22.98   <2e-16 ***
b  0.079957   0.003569   22.40   <2e-16 ***
c 12.723995   0.511067   24.90   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.07468 on 25 degrees of freedom

Number of iterations to convergence: 2 
Achieved convergence tolerance: 5.255e-09

I want to fit the same model but considering region, and test if a, b, and c estimates are different between regions.

I belive this post might be usefull, but I dont know how to apply it to this data/function.

Here is how data looks like:

enter image description here

Solutions without using nls are welcome as well


Solution

  • Fit the model with the same parameters for each region giving fm1 and again with different parameters giving fm2 and use anova to test the difference.

    We use the plinear algorithm for fm1 as it eliminates the need for starting values for the linear parameters. In that case the RHS should be a matrix whose first column multiplies the intercept and whose second column multiplies the slope. The two linear parameters will be named .lin1 and .lin2. We use the coefficients from fm1 repeated 4 times as starting values for the fm2 fit.

    fm1 <- nls(height ~ cbind(1, pmin(age, c)), df1, start = list(c = mean(df1$age)),
      algorithm = "plinear")
    co <- as.list(coef(fm1))
    
    fm2 <- nls(height ~ a[region] + b[region] * pmin(age, c[region]), df1, 
      start = list(a = rep(co$.lin1, 4), b = rep(co$.lin2, 4), c = rep(co$c, 4)))
    
    anova(fm1, fm2)
    

    giving:

    Analysis of Variance Table
    
    Model 1: height ~ cbind(1, pmin(age, c))
    Model 2: height ~ a[region] + b[region] * pmin(age, c[region])
      Res.Df Res.Sum Sq Df   Sum Sq F value Pr(>F)
    1     25    0.13944                           
    2     16    0.11895  9 0.020483  0.3061 0.9617
    

    thus we cannot reject the hypothesis that the parameters are the same across regions.

    If we wished to test differing values of c but common intercepts and slopes we could use

    fm3 <- nls(height ~ cbind(1, pmin(age, c[region])), df1, 
       start = list(c = rep(co$c, 4)), algorithm = "plinear")
    
    anova(fm1, fm3)
    

    Although we cannot reject the hypothesis that the values of c are the same across regions visually below we see that the cutoff ages for plateau values do look somewhat different so we might want to use fm3 even though it is not significantly different than fm1. We might want to be guided by other factors associated with the application here rather than just the fit.

    Graphics

    Below we show the individual fits from fm2 and the overall fit from fm1.

    library(ggplot2)
    
    df1$Everything <- "Everything"
    ggplot(df1, aes(age, fitted(fm2), col = region)) +
      geom_line() +
      geom_point() +
      geom_line(aes(age, fitted(fm1), col = Everything), lty = 2, lwd = 2)
    

    screenshot