Check that model has only one factor covariate

I am writing an R package, where the main function takes a model, which may only have a single factor covariate (offsets are allowed). To make sure the user complies with this rule I need to check this.

As an example, let's look at the following four models:

set.seed(123)
n <- 10 

## data
data <- data.frame(y = rnorm(n), 
  trt = rep(c(0, 1), each = n/2),
  x = 1:n)
datan <- data
datan$trt <- as.factor(datan$trt)

## models
mod1 <- lm(y ~ factor(trt), data = data)
mod2 <- lm(y ~ offset(x) + as.factor(trt), data = data)
mod3 <- lm(y ~ trt, data = datan)
mod4 <- glm(y ~ trt + offset(x), data = data)
mod5 <- lm(y ~ x + as.factor(trt), data = data)

Models 1, 2 and 3 are ok, models 4 and 5 are not ok (model 4 has a non-factor variable trt, model 5 has a second covariate x).

How do I check this using R? Optimally I'd get a TRUE for a model that's ok, and a FALSE for a problematic model.

This should work not only with lm() and glm(), but also with survreg() and coxph() (from package survival). Something that might be useful is looking at the formula eval(getCall(mod1)$formula) and the data (data/ datan).

Solution

As indicated in the previous reply by @LAP you can use the terms() from these models. However, I would recommend to look at the attr(..., "factors") and attr(..., "dataClasses") instead of going to the $model which requires that the entire model.frame() is stored in the model. This may or may not be the case. Specifically, when re-fitting multiple models you might want to be able to not store the model frame each time.

So one idea would be to proceed in the following steps:

Check whether attr(..., "factors") has not exactly one column, the you can return FALSE.
If there is exactly one factor, you can check the corresponding attr(..., "dataClasses") if it is "factor"/"ordered" and then return TRUE, otherwise FALSE.

R code:

one_factor <- function(object) {
  f <- attr(terms(object), "factors")
  if(length(f) == 0L || NCOL(f) != 1L) return(FALSE)
  d <- attr(terms(object), "dataClasses")
  if(d[colnames(f)] %in% c("ordered", "factor")) {
    return(TRUE)
  } else {
    return(FALSE)
  }
}

This appears to work well for single-part formula-based objects.

Dummy data with numeric/factor/ordered trt:

d1 <- d2 <- d3 <- data.frame(y = log(1:9), x = 1:9, trt = rep(1:3, each = 3)) 
d2$trt <- factor(d2$trt)
d3$trt <- ordered(d3$trt)

Various formula specifications:

f <- list(
  y ~ 1,
  y ~ x,
  y ~ trt,
  y ~ trt + x,
  y ~ trt + offset(x),
  y ~ trt + x + offset(x),
  y ~ trt + offset(as.numeric(trt)),
  y ~ factor(trt),
  y ~ factor(trt) + offset(x),
  y ~ factor(x > as.numeric(trt)),
  y ~ interaction(x, trt),
  y ~ 0 + trt
)

Expected results for d1, d2, and d3, respectively:

ok1 <- c(FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, FALSE)
ok2 <- c(FALSE, FALSE, TRUE,  FALSE, TRUE,  FALSE, TRUE,  TRUE, TRUE, TRUE, TRUE, TRUE)
ok3 <- ok2

Checks for lm without storing the model frame:

lm1 <- lapply(f, lm, data = d1, model = FALSE)
identical(sapply(lm1, one_factor), ok1)
## [1] TRUE
lm2 <- lapply(f, lm, data = d2, model = FALSE)
identical(sapply(lm2, one_factor), ok2)
## [1] TRUE
lm3 <- lapply(f, lm, data = d3, model = FALSE)
identical(sapply(lm3, one_factor), ok3)
## [1] TRUE

Checks for survreg (Gaussian) and coxph. (The latter throws a lot of warnings about non-convergence which is not surprising given the dummy data structure. The checks still work as intended.)

library("survival")
d1$y <- d2$y <- d3$y <- Surv(d1$y + 0.5)

sr1 <- lapply(f, survreg, data = d1)
identical(sapply(sr1, one_factor), ok1)
## [1] TRUE
sr2 <- lapply(f, survreg, data = d2)
identical(sapply(sr2, one_factor), ok2)
## [1] TRUE
sr3 <- lapply(f, survreg, data = d3)
identical(sapply(sr3, one_factor), ok3)
## [1] TRUE

cph1 <- lapply(f, coxph, data = d1)
identical(sapply(cph1, one_factor), ok1)
## [1] TRUE
cph2 <- lapply(f, coxph, data = d2)
identical(sapply(cph2, one_factor), ok2)
## [1] TRUE
cph3 <- lapply(f, coxph, data = d3)
identical(sapply(cph3, one_factor), ok3)
## [1] TRUE

Note: If you have multi-part Formula-based objects this function might fail and the underlying tests would need to be adapted. Examples for the latter could include count regression models (zeroinfl, hurdle), multinomial logit (mlogit), instrumental variables (ivreg), heteroscedastic models (vglm, betareg, crch) etc. These might have formulas like y ~ trt | 1 or y ~ trt | trt or y ~ trt | x which may or may not still be feasible in your framework.