Search code examples
rmodelregressionlmpolynomials

R polynomal regression or group values and test between groups + outcome interpreatation


I am trying to model the relation between a scar acquisition rate of a wild population of animals, and I have calculated yearly rates before.

If you see below the plot, it seems to me that rates rise through the middle of the period and than fall again. I have tried to fit a polynomial LM with the code

model1 <- lm(Rate~poly(year, 2, raw = TRUE),data=yearlyratesub) 
summary(model1)                         
model1 

I have plotted using:

g <-ggplot(yearlyratesub, aes(year, Rate)) + geom_point(shape=1) +  geom_smooth(method = lm, formula = y ~ poly(x, 2, raw = TRUE))
g

The model output was:

Call:
lm(formula = Rate ~ poly(year, 2, raw = TRUE), data = yearlyratesub)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.126332 -0.037683 -0.002602  0.053222  0.083503 

Coefficients:
                             Estimate Std. Error t value Pr(>|t|)  
(Intercept)                -8.796e+03  3.566e+03  -2.467   0.0297 *
poly(year, 2, raw = TRUE)1  8.747e+00  3.545e+00   2.467   0.0297 *
poly(year, 2, raw = TRUE)2 -2.174e-03  8.813e-04  -2.467   0.0297 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.0666 on 12 degrees of freedom
Multiple R-squared:  0.3369,    Adjusted R-squared:  0.2264 
F-statistic: 3.048 on 2 and 12 DF,  p-value: 0.08503

How can I enterpret that now? The overall model p value is not significant but the intercept and single slopes are?

Should I rather try another fit than x² or even group the values and test between groups e.g. with an ANOVA? I know the LM has low fit but I guess it's because I have little values and maybe x² might be not it...?

Would be happy about input regarding model and outcome interpretation..

enter image description here


Solution

  • Grouping

    Since the data was not provided (next time please provide a complete reproducible question including all inputs) we used the data in the Note at the end. We see that that the model is highly significant if we group the points using the indicated breakpoints.

    g <- factor(findInterval(yearlyratesub$year, c(2007.5, 2014.5))+1); g
    ##  [1] 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3
    ## Levels: 1 2 3
    
    fm <- lm(rate ~ g, yearlyratesub)
    summary(fm)
    

    giving

    Call:
    lm(formula = rate ~ g, data = yearlyratesub)
    
    Residuals:
          Min        1Q    Median        3Q       Max 
    -0.064618 -0.018491  0.006091  0.029684  0.046831 
    
    Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
    (Intercept)  0.110854   0.019694   5.629 0.000111 ***
    g2           0.127783   0.024687   5.176 0.000231 ***
    g3          -0.006714   0.027851  -0.241 0.813574    
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    Residual standard error: 0.03939 on 12 degrees of freedom
    Multiple R-squared:  0.7755,    Adjusted R-squared:  0.738 
    F-statistic: 20.72 on 2 and 12 DF,  p-value: 0.0001281
    

    We could consider combining the outer two groups.

    g2 <- factor(g == 2)
    fm2 <- lm(rate ~ g2, yearlyratesub)
    summary(fm2)
    

    giving:

    Call:
    lm(formula = rate ~ g2, data = yearlyratesub)
    
    Residuals:
          Min        1Q    Median        3Q       Max 
    -0.064618 -0.016813  0.007096  0.031363  0.046831 
    
    Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
    (Intercept)  0.10750    0.01341   8.015 2.19e-06 ***
    g2TRUE       0.13114    0.01963   6.680 1.52e-05 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    Residual standard error: 0.03793 on 13 degrees of freedom
    Multiple R-squared:  0.7744,    Adjusted R-squared:  0.757 
    F-statistic: 44.62 on 1 and 13 DF,  p-value: 1.517e-05
    

    Sinusoid

    Looking at the graph it seems that the points are turning up at the left and right edges suggesting we use a sinusoidal fit. a + b * cos(c * year)

    fm3 <- nls(rate ~ cbind(a = 1, b = cos(c * year)), 
      yearlyratesub, start = list(c = 0.5), algorithm = "plinear")
    summary(fm3)
    

    giving

    Formula: rate ~ cbind(a = 1, b = cos(c * year))
    
    Parameters:
            Estimate Std. Error  t value Pr(>|t|)    
    c      0.4999618  0.0001449 3449.654  < 2e-16 ***
    .lin.a 0.1787200  0.0150659   11.863  5.5e-08 ***
    .lin.b 0.0753754  0.0205818    3.662  0.00325 ** 
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    Residual standard error: 0.05688 on 12 degrees of freedom
    
    Number of iterations to convergence: 2 
    Achieved convergence tolerance: 5.241e-08
    

    Comparison

    Plotting the fits and looking at their residual sum of squares and AIC we have

    plot(yearlyratesub)
    # fm0 from Note at end, fm and fm2 are grouping models, fm3 is sinusoidal
    L <- list(fm0 = fm0, fm = fm, fm2 = fm2, fm3 = fm3)
    for(i in seq_along(L)) {
      lines(fitted(L[[i]]) ~ year, yearlyratesub, col = i, lwd = 2)
    }
    legend("topright", names(L), col = seq_along(L), lwd = 2)
    

    giving the following where lower residual sum of squares and AIC (which takes into account the number of paramters) are better. We see that fm fits the most closely based on residual sum of squares but with fm2 fitting almost as well; however, when taking the number of parameters into account by using AIC fm2 has the lowest and so is most favored by that criterion.

    cbind(RSS = sapply(L, deviance), AIC = sapply(L, AIC))
    ##            RSS       AIC
    ## fm0 0.05488031 -33.59161
    ## fm  0.01861659 -49.80813
    ## fm2 0.01870674 -51.73567
    ## fm3 0.04024237 -38.24512
    

    screenshot

    Note

    yearlyratesub <- 
    structure(list(year = c(2004, 2005, 2006, 2007, 2008, 2009, 2010, 
    2011, 2012, 2013, 2014, 2015, 2017, 2018, 2019), rate = c(0.14099813521287, 
    0.0949946651016247, 0.0904788394070601, 0.11694517831575, 0.26786193592875, 
    0.256346628540479, 0.222029818828298, 0.180116679856725, 0.285467976459104, 
    0.174019208113095, 0.28461698734932, 0.0574827955982996, 0.103378448084776, 
    0.114593695172686, 0.141105952837639)), row.names = c(NA, -15L
    ), class = "data.frame")
    
    fm0 <- lm(rate ~ poly(year, 2, raw = TRUE), yearlyratesub)
    summary(fm0)
    

    giving

    Call:
    lm(formula = rate ~ poly(year, 2, raw = TRUE), data = yearlyratesub)
    
    Residuals:
          Min        1Q    Median        3Q       Max 
    -0.128335 -0.038289 -0.002715  0.054090  0.084792 
    
    Coefficients:
                                 Estimate Std. Error t value Pr(>|t|)  
    (Intercept)                -8.930e+03  3.621e+03  -2.466   0.0297 *
    poly(year, 2, raw = TRUE)1  8.880e+00  3.600e+00   2.467   0.0297 *
    poly(year, 2, raw = TRUE)2 -2.207e-03  8.949e-04  -2.467   0.0297 *
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    Residual standard error: 0.06763 on 12 degrees of freedom
    Multiple R-squared:  0.3381,    Adjusted R-squared:  0.2278 
    F-statistic: 3.065 on 2 and 12 DF,  p-value: 0.0841