Search code examples
rposthocemmeans

post-hoc comparisons on a linear model


I'm running a linear model, and would like to compare a set of points on the slope to the estimates at 0. My code follows the layout of the response here. The output seems to have a single, identical p-value. I would have expected values near 0 to have high p-values and values far from 0 to have small p-values. I definitely didn't expect to have identical p-values across the comparisons. Any suggestions?

Toy dataset:

library(ggplot2)
library(tidyr)
library(emmeans)

df <- structure(list(Distance = c(0, 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5, 5.5, 6, 6.5, 7, 7.5, 8, 8.5, 9, 9.5), 
                    Mean = c(139, 119.8, 121, 130.4, 115.9, 134.7, 134.7, 122.2, 118.8, 116.9, 114.4, 
                            109.6, 103.9, 113.2, 103.5, 113.3, 122.1, 105.9, 115.2)), row.names = c(NA, -19L), 
                class = c("tbl_df", "tbl", "data.frame"))

m <- lm(Mean ~ Distance, data = df)
df$Pred <- predict(m)

# data and predictions look ok
ggplot(df) +
    geom_point(aes(x = Distance, y = Mean)) +
    geom_line(aes(x = Distance, y = Pred)) 

# create a fake grid for emmeans
fake.df <- data.frame(Distance = 0:10)

# run a treatment vs control, where control is value at 0 and "treatment" are values
# stepping away from 0
emm <- emmeans(m, trt.vs.ctrl1 ~ Distance, data = fake.df,  
            cov.reduce = FALSE, covnest = TRUE)
emm             

Solution

  • In this model, Distance is a numeric predictor with only a linear effect. Thus, any test comparing model estimates at two Distances is just a test of the slope of the Distance trend, and all such tests thus have the same P value.

    Addendum

    The question is a clue to how easy it is to confuse estimation and prediction.

    Estimation is about parameters; in this example, the slope of the line is a single parameter, estimated with all the data, and any comparison of estimates at two distances is equivalent to testing the significance of the slope.

    Prediction is about what will happen with future data. To predict those data, we have to account not only for the variation in estimating the slope (in this case), but also the variation inherent in the future data (as estimated by the the RMSE). If we truly believe that the error distribution is normal, we can obtain prediction intervals as follows:

    > emm <- emmeans(m, "Distance", at = list(Distance = c(0,2,4,6,8,10)))
    
    > predict(emm, interval = "pred", sigma = sigma(m))
     Distance prediction   SE df lower.PL upper.PL
            0        131 8.61 17    112.5      149
            2        126 8.22 17    108.5      143
            4        121 8.02 17    104.1      138
            6        116 8.02 17     99.3      133
            8        111 8.23 17     94.0      129
           10        107 8.62 17     88.3      125
    
    Prediction intervals and SEs are based on an error SD of 7.7904 
    Confidence level used: 0.95 
    

    Now, suppose we want to compare two independent future observations Y0 (taken at Distance = 0 and Y2 (taken at Distance = 2). The prediction of Y0 - Y2 is estimated as 131 - 126 = 5, and the SE of prediction is sqrt(8.61^2 + 8.22^2) = 11.90. So Y0 - Y2 will be about 5 +/- 2*11.9, or (-18.8, 28.8) - an interval that contains zero.

    If we want to compare future values of Y0 and Y10 (taken at Distance = 10), however, we predict (131 - 107) +/- 2*sqrt(8.61^2+8.62^2) --> (-0.4, 48.4). This interval still includes zero, but just barely; so it is much more likely that Y10 will be less than Y0 than it is for Y2 to be less than Y0.

    I hope this helps clarify the situation.