Search code examples
rmodel.matrix

Is there an easier way to manually calculate estimated group means using the model.matrix?


I want to calculate estimated group mean scores in a 2x2 Gaussian regression after obtaining the regression coefficients. Here is toy data. 100 observations each of region - a and b - and sex - m and f. I have designed the scores so there is a 5-point difference on average between regions a and b but no difference between m and f.

set.seed(1234)

d <- data.frame(region = factor(rep(letters[1:2],each=100)),
                sex = factor(rep(c("m", "f"),times=100)),
                score = round(x = c(rnorm(100, mean = 5, sd = 1),
                                    rnorm(100, mean = 10, sd = 1)),
                              digits = 1))

Now I will use the model.matrix() function to obtain contrast coefficients for each observation, based on its group membership. I will use treatment coding, that is [0,1] with region a and sex m as the reference levels for each.

model.matrix(object = score ~ region*sex,
             data = d,
             contrasts.arg = list(region = contr.treatment(nlevels(d$region)),
                                  sex = contr.treatment(nlevels(d$region)))) -> cmTreat

Now we can use the model matrix directly in the regression using the lm() function. We specify 0 + terms because the model matrix already contains an intercept.

(lm(d$score ~ 0 + cmTreat) -> lmTreat)

# output
# Call:
#   lm(formula = d$score ~ 0 + cmTreat)
# 
# Coefficients:
# cmTreat(Intercept)       cmTreatregion2          cmTreatsex2  cmTreatregion2:sex2  
#              4.814                5.132                0.056                0.140 

The regression has retrieved the main effects and interactions. But what if we want to get estimated marginal means, specifically the estimated mean in each 'cell' of the 2 x 2: region a - female, region a - male, region b - female, region b - male.

We can do this manually via the attributes of the model matrix.

treatCoefs <- coef(lmTreat) # assign the vector of coefficients a name

# mean in region a female: intercept[1] + region[0] + sex[0] + region[0]*sex[0]
regionA_f <- treatCoefs[1] + treatCoefs[2]*attr(cmTreat, which = "contrasts")$region[,1][1] + treatCoefs[3]*attr(cmTreat, which = "contrasts")$sex[,1][1] + treatCoefs[4]*attr(cmTreat, which = "contrasts")$region[,1][1]*attr(cmTreat, which = "contrasts")$sex[,1][1]

# mean in region a male: intercept[1] + region[0] + sex[1] + region[0]*sex[1]
regionA_m <- treatCoefs[1] + treatCoefs[2]*attr(cmTreat, which = "contrasts")$region[,1][1] + treatCoefs[3]*attr(cmTreat, which = "contrasts")$sex[,1][2] + treatCoefs[4]*attr(cmTreat, which = "contrasts")$region[,1][1]*attr(cmTreat, which = "contrasts")$sex[,1][2]

# mean in region b female: : intercept[1] + region[1] + sex[0] + region[1]*sex[0]
regionB_f <- treatCoefs[1] + treatCoefs[2]*attr(cmTreat, which = "contrasts")$region[,1][2] + treatCoefs[3]*attr(cmTreat, which = "contrasts")$sex[,1][1] + treatCoefs[4]*attr(cmTreat, which = "contrasts")$region[,1][2]*attr(cmTreat, which = "contrasts")$sex[,1][1]

# mean in group b male: intercept[1] + region[1] + sex[1] + region[1]*sex[1]
regionB_m <-treatCoefs[1] + treatCoefs[2]*attr(cmTreat, which = "contrasts")$region[,1][2] + treatCoefs[3]*attr(cmTreat, which = "contrasts")$sex[,1][2] + treatCoefs[4]*attr(cmTreat, which = "contrasts")$region[,1][2]*attr(cmTreat, which = "contrasts")$sex[,1][2]

Now if we compare the actual group means to the estimated means (apologies non dplyr people)...

(library(dplyr)
d %>%
  group_by(region, sex) %>%
    summarise(actualMean = mean(score)) %>%
      add_column(estMeans = c(regionA_f, regionA_m, regionB_f, regionB_m))

# # A tibble: 4 × 4
# # Groups:   region [2]
# region  sex    actualMean estMeans
# <fct>   <fct>        <dbl>    <dbl>
# 1 a      f           4.81     4.81
# 2 a      m           4.87     4.87
# 3 b      f           9.95     9.95
# 4 b      m           10.1     10.1

So this works great. "What is the problem?" I hear you ask. Well, you saw how much code was required to get the estimated means for each group. And I can do it. But I was wondering "Is there was an easier way to do this manually?".

I know I can use Russ Lenth's excellent emmeans package and do use that a lot, but I wanted to learn how to do it manually in a more elegant way. I know nothing of matrix algebra and not a lot about contrast matrices. I just can't help feeling as if there is a better way (one whose method might adapt better across different designs and levels).

p.s. this question may have been better suited to cross validated but I thought I would try here first as it is just r-specific enough to warrant posting on SO.


Solution

  • As I mentioned in the comments above, the predict() function is what you seem to be after, but it wasn't completely clear when you wrote of doing things manually whether you wanted an avoid using predict or not.

    You can use expand.grid() to make a data frame of factor combinations to use with predict() (ensure that the factor levels are in the same order as used in the model):

    (grid <- expand.grid(region = factor(c("a", "b")), sex = factor(c("m", "f"))))
    
      region sex
    1      a   m
    2      b   m
    3      a   f
    4      b   f
    
    lm(score ~ region * sex, d) |> 
      predict(newdata = grid) |> 
      cbind(grid, pred = _)
    
      region sex   pred
    1      a   m  4.870
    2      b   m 10.142
    3      a   f  4.814
    4      b   f  9.946
    

    However, in your example you used the design matrix directly in the model, so we need to make sure we are also passing a data frame containing a matrix of the same width to predict(). We can create this with model.matrix().

    (design <- model.matrix(~ region * sex, grid))
    
      (Intercept) regionb sexm regionb:sexm
    1           1       0    1            0
    2           1       1    1            1
    3           1       0    0            0
    4           1       1    0            0
    attr(,"assign")
    [1] 0 1 2 3
    attr(,"contrasts")
    attr(,"contrasts")$region
    [1] "contr.treatment"
    
    attr(,"contrasts")$sex
    [1] "contr.treatment"
    
    cbind(grid,
          pred = predict(lmTreat, data.frame(cmTreat = I(design))))
    
      region sex   pred
    1      a   m  4.870
    2      b   m 10.142
    3      a   f  4.814
    4      b   f  9.946
    

    We can also calculate this without using predict() via:

    cbind(grid, pred = rowSums(design * coef(lmTreat)[col(design)]))
    
      region sex   pred
    1      a   m  4.870
    2      b   m 10.142
    3      a   f  4.814
    4      b   f  9.946