lmerTest::lmer: Creating an interaction term manually

Short version

factorC <- paste(factorA, factorB)
Is factorA*factorB equivalent to factorA+factorB+factorC in the formula of a linear mixed effects model?

Background

I use the function ancombc2() from the package ANCOMBC. ANCOMBC is a package used for differential abundance testing with species frequency data. For details see here: https://www.nature.com/articles/s41592-023-02092-7 However, I hope we do not have to deal with the specificities of this package for this question, since ANCOMBC is using lmer() from lmerTest / lme4.

My question is about the manual inclusion of an interaction term. ANCOMBC does not support interaction terms. However, the authors write the following:

Q: Can the ancombc or ancombc2 function handle interaction terms in the analysis?

A: Unfortunately, the inclusion of interaction terms in the fix_formula argument of ancombc or ancombc2 can lead to complexities and potential confusion in the multi-group comparisons. To address this, it is recommended to manually create the interaction term of interest outside of the formula and perform the analysis accordingly.

By manually creating the interaction term, you can ensure that the analysis accurately captures the interaction effect between variables. Once the interaction term is created, you can include it in the fix_formula argument or any other relevant part of the analysis, depending on your specific research question and design.

From: https://github.com/FrederickHuangLin/ANCOMBC

My interpretation and reproducible example with `lmer()`

I use lmer() from lmeTest for testing. I will show how I interpreted the authors explanation using the following example data:

In an experiment observations were made for samples from two treatment groups and from two age groups. Additionally, the experiment was repeated in two years. Thus, the experiment includes the following factors:

"year" (random) with levels "1" and "2"
"treatment" (fixed) with levels "control" and "treated"
"age" (fixed) with levels "old" and "new"

Here is a dummy dataset:

# create example data
set.seed(123)
dat_year1 <- data.frame(year = factor(1),
                        treatment = c(rep("control", 20),
                                      rep("treated", 20)),
                        age = c(rep("old", 10),
                                rep("new", 10),
                                rep("old", 10),
                                rep("new", 10)),
                        value = c(rnorm(n = 10, mean = 10, sd = 2),
                                  rnorm(n = 10, mean = 10, sd = 2),
                                  rnorm(n = 10, mean = 15, sd = 2),
                                  rnorm(n = 10, mean = 16, sd = 2)))

set.seed(321)
dat_year2 <- data.frame(year = factor(2),
                        treatment = c(rep("control", 20),
                                      rep("treated", 20)),
                        age = c(rep("old", 10),
                                rep("new", 10),
                                rep("old", 10),
                                rep("new", 10)),
                        value = c(rnorm(n = 10, mean = 10, sd = 2),
                                  rnorm(n = 10, mean = 10, sd = 2),
                                  rnorm(n = 10, mean = 15, sd = 2),
                                  rnorm(n = 10, mean = 18, sd = 2)))

dat <- rbind(dat_year1, dat_year2)

And a visualization of it:

I am interested in testing the interaction between "treatment" and "age". With lmer() I would do that as follows:

# run lmer()
test <- lmerTest::lmer(value~treatment*age+(1|year), data = dat)
summary(test)

Returning:

Linear mixed model fit by REML. t-tests use
  Satterthwaite's method [lmerModLmerTest]
Formula: value ~ treatment * age + (1 | year)
   Data: dat

REML criterion at convergence: 322.2

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-2.41222 -0.69075  0.07493  0.44019  2.22631 

Random effects:
 Groups   Name        Variance Std.Dev.
 year     (Intercept) 0.05503  0.2346  
 Residual             3.44757  1.8568  
Number of obs: 80, groups:  year, 2

Fixed effects:
                        Estimate Std. Error      df t value
(Intercept)              10.6492     0.4471  7.6713  23.819
treatmenttreated          6.6717     0.5872 75.0000  11.363
ageold                   -0.4259     0.5872 75.0000  -0.725
treatmenttreated:ageold  -2.6641     0.8304 75.0000  -3.208
                        Pr(>|t|)    
(Intercept)             1.81e-08 ***
treatmenttreated         < 2e-16 ***
ageold                   0.47044    
treatmenttreated:ageold  0.00196 ** 
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation of Fixed Effects:
            (Intr) trtmnt ageold
tretmnttrtd -0.657              
ageold      -0.657  0.500       
trtmnttrtd:  0.464 -0.707 -0.707

I interpreted the recommendation to create the interaction term as meaning that a new factor should be manually created that combines the levels of the factors for which an interaction is to be tested:

dat$treatment_age <- paste(dat$treatment, "_", dat$age, sep = "")

Now I included this in the model formula as another fixed factor:

test_manual <- lmerTest::lmer(value~treatment+age+treatment_age+(1|year),
                              data = dat)
summary(test_manual)

Returning:

Linear mixed model fit by REML. t-tests use
  Satterthwaite's method [lmerModLmerTest]
Formula: 
value ~ treatment + age + treatment_age + (1 | year)
   Data: dat

REML criterion at convergence: 322.2

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-2.41222 -0.69075  0.07493  0.44019  2.22631 

Random effects:
 Groups   Name        Variance Std.Dev.
 year     (Intercept) 0.05503  0.2346  
 Residual             3.44757  1.8568  
Number of obs: 80, groups:  year, 2

Fixed effects:
                         Estimate Std. Error      df
(Intercept)               10.6492     0.4471  7.6713
treatmenttreated           6.6717     0.5872 75.0000
ageold                    -3.0900     0.5872 75.0000
treatment_agecontrol_old   2.6641     0.8304 75.0000
                         t value Pr(>|t|)    
(Intercept)               23.819 1.81e-08 ***
treatmenttreated          11.363  < 2e-16 ***
ageold                    -5.263 1.30e-06 ***
treatment_agecontrol_old   3.208  0.00196 ** 
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation of Fixed Effects:
            (Intr) trtmnt ageold
tretmnttrtd -0.657              
ageold       0.000 -0.500       
trtmnt_gcn_ -0.464  0.707 -0.707
fit warnings:
fixed-effect model matrix is rank deficient so dropping 2 columns / coefficients

For "treatment" all statistics are identical, for "age" they are much different, and for "treatment:age" the estimate and t-value are now negative. Additionally, there is a fit warning "fixed-effect model matrix is rank deficient so dropping 2 columns / coefficients".

Question

How can I improve this to correctly include the interaction term manually?

Use with `ancombc2`

When I try to use ancombc2() with the manually included interaction term it does not execute but returns an error:

library(ANCOMBC)
library(phyloseq)

ancombc2(data = phyloseq_object,
         tax_level = NULL,
         fix_formula = "treatment+age+treatment_age",
         verbose = TRUE)

Returning:

Obtaining initial estimates ...

Error: Estimation failed for the following covariates:
treated_old, treated_new
Please ensure that these covariates do not have missing values and check for multicollinearity before re-estimating the model

Solution

In your short version, factorC on it's own is equivalent to factorA*factorB - you don't need to include factorA+factorB. That's why you get the rank deficiency warning in your longer version, two of the levels in treatment_age are already covered by the treatment and age main effects.

lmerTest::lmer: Creating an interaction term manually

Short version

Background

My interpretation and reproducible example with lmer()

Question

Use with ancombc2

My interpretation and reproducible example with `lmer()`

Use with `ancombc2`