I am writing an R package, where the main function takes a model, which may only have a single factor covariate (offsets are allowed). To make sure the user complies with this rule I need to check this.
As an example, let's look at the following four models:
set.seed(123)
n <- 10
## data
data <- data.frame(y = rnorm(n),
trt = rep(c(0, 1), each = n/2),
x = 1:n)
datan <- data
datan$trt <- as.factor(datan$trt)
## models
mod1 <- lm(y ~ factor(trt), data = data)
mod2 <- lm(y ~ offset(x) + as.factor(trt), data = data)
mod3 <- lm(y ~ trt, data = datan)
mod4 <- glm(y ~ trt + offset(x), data = data)
mod5 <- lm(y ~ x + as.factor(trt), data = data)
Models 1, 2 and 3 are ok, models 4 and 5 are not ok (model 4 has a non-factor variable trt
, model 5 has a second covariate x
).
How do I check this using R? Optimally I'd get a TRUE
for a model that's ok, and a FALSE
for a problematic model.
This should work not only with lm()
and glm()
, but also with survreg()
and coxph()
(from package survival). Something that might be useful is looking at the formula eval(getCall(mod1)$formula)
and the data (data
/ datan
).
As indicated in the previous reply by @LAP you can use the terms()
from these models. However, I would recommend to look at the attr(..., "factors")
and attr(..., "dataClasses")
instead of going to the $model
which requires that the entire model.frame()
is stored in the model. This may or may not be the case. Specifically, when re-fitting multiple models you might want to be able to not store the model frame each time.
So one idea would be to proceed in the following steps:
attr(..., "factors")
has not exactly one column, the you can return FALSE
.attr(..., "dataClasses")
if it is "factor"
/"ordered"
and then return TRUE
, otherwise FALSE
.R code:
one_factor <- function(object) {
f <- attr(terms(object), "factors")
if(length(f) == 0L || NCOL(f) != 1L) return(FALSE)
d <- attr(terms(object), "dataClasses")
if(d[colnames(f)] %in% c("ordered", "factor")) {
return(TRUE)
} else {
return(FALSE)
}
}
This appears to work well for single-part formula
-based objects.
Dummy data with numeric/factor/ordered trt
:
d1 <- d2 <- d3 <- data.frame(y = log(1:9), x = 1:9, trt = rep(1:3, each = 3))
d2$trt <- factor(d2$trt)
d3$trt <- ordered(d3$trt)
Various formula specifications:
f <- list(
y ~ 1,
y ~ x,
y ~ trt,
y ~ trt + x,
y ~ trt + offset(x),
y ~ trt + x + offset(x),
y ~ trt + offset(as.numeric(trt)),
y ~ factor(trt),
y ~ factor(trt) + offset(x),
y ~ factor(x > as.numeric(trt)),
y ~ interaction(x, trt),
y ~ 0 + trt
)
Expected results for d1
, d2
, and d3
, respectively:
ok1 <- c(FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, FALSE)
ok2 <- c(FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE)
ok3 <- ok2
Checks for lm
without storing the model frame:
lm1 <- lapply(f, lm, data = d1, model = FALSE)
identical(sapply(lm1, one_factor), ok1)
## [1] TRUE
lm2 <- lapply(f, lm, data = d2, model = FALSE)
identical(sapply(lm2, one_factor), ok2)
## [1] TRUE
lm3 <- lapply(f, lm, data = d3, model = FALSE)
identical(sapply(lm3, one_factor), ok3)
## [1] TRUE
Checks for survreg
(Gaussian) and coxph
. (The latter throws a lot of warnings about non-convergence which is not surprising given the dummy data structure. The checks still work as intended.)
library("survival")
d1$y <- d2$y <- d3$y <- Surv(d1$y + 0.5)
sr1 <- lapply(f, survreg, data = d1)
identical(sapply(sr1, one_factor), ok1)
## [1] TRUE
sr2 <- lapply(f, survreg, data = d2)
identical(sapply(sr2, one_factor), ok2)
## [1] TRUE
sr3 <- lapply(f, survreg, data = d3)
identical(sapply(sr3, one_factor), ok3)
## [1] TRUE
cph1 <- lapply(f, coxph, data = d1)
identical(sapply(cph1, one_factor), ok1)
## [1] TRUE
cph2 <- lapply(f, coxph, data = d2)
identical(sapply(cph2, one_factor), ok2)
## [1] TRUE
cph3 <- lapply(f, coxph, data = d3)
identical(sapply(cph3, one_factor), ok3)
## [1] TRUE
Note: If you have multi-part Formula
-based objects this function might fail and the underlying tests would need to be adapted. Examples for the latter could include count regression models (zeroinfl
, hurdle
), multinomial logit (mlogit
), instrumental variables (ivreg
), heteroscedastic models (vglm
, betareg
, crch
) etc. These might have formulas like y ~ trt | 1
or y ~ trt | trt
or y ~ trt | x
which may or may not still be feasible in your framework.