I'm attempting to test a logistic regression model (e.g. 3 coefficients for 3 predictor variables, X1,X2,X3), on a dataset. I'm aware of how to test a model after i created the model object using, for example,
mymodel <- glm( Outcome ~ X1 + X2 + X3 , family = binomial,data=trainDat)
and then test the data
prob <- predict(mymodel,type="response",newdata=test)
But i want to, now, create a logistic model using coefficients and intercept that I have, and then test this model on data.
Basically I'm not clear on how to create "mymodel" without running glm.
Context for the question: I've run a logistic regression using offsets eg:
mymodel <- glm(Outcome ~ offset(C1 * X1) + offset(C2 * X2) + X3,
family = binomial, data = trainDat)
Thus, the mymodel object generates a model with only an intercept (I) and C3 coefficients (for feature X3).
I now need to test the full model (i.e. I + C1*X1 + C2*X2 + C3*X3), on a test dataset, but I don't know how to get the full model, since the output of mymodel has only intercept and C3. So I thought my more general question was: "how do you manually build a logisitic regression model object?"
Thank you for your patience.
I could not find a simple function to do this. There is some code in the predict
function that depends on having a fitted model (like determining the rank of the model). However, we can create a function to make a fake glm object that you can use with predict. Here's my first attempt at such a function
makeglm <- function(formula, family, data=NULL, ...) {
dots <- list(...)
out<-list()
tt <- terms(formula, data=data)
if(!is.null(data)) {
mf <- model.frame(tt, data)
vn <- sapply(attr(tt, "variables")[-1], deparse)
if((yvar <- attr(tt, "response"))>0)
vn <- vn[-yvar]
xlvl <- lapply(data[vn], function(x) if (is.factor(x))
levels(x)
else if (is.character(x))
levels(as.factor(x))
else
NULL)
attr(out, "xlevels") <- xlvl[!vapply(xlvl,is.null,NA)]
attr(tt, "dataClasses") <- sapply(data[vn], stats:::.MFclass)
}
out$terms <- tt
coef <- numeric(0)
stopifnot(length(dots)>1 & !is.null(names(dots)))
for(i in seq_along(dots)) {
if((n<-names(dots)[i]) != "") {
v <- dots[[i]]
if(!is.null(names(v))) {
coef[paste0(n, names(v))] <- v
} else {
stopifnot(length(v)==1)
coef[n] <- v
}
} else {
coef["(Intercept)"] <- dots[[i]]
}
}
out$coefficients <- coef
out$rank <- length(coef)
out$qr <- list(pivot=seq_len(out$rank))
out$family <- if (class(family) == "family") {
family
} else if (class(family) == "function") {
family()
} else {
stop(paste("invalid family class:", class(family)))
}
out$deviance <- 1
out$null.deviance <- 1
out$aic <- 1
class(out) <- c("glm","lm")
out
}
So this function creates an object and passes all the values that predict
and print
expect to find on such an object. Now we can test it out. First, here's some test data
set.seed(15)
dd <- data.frame(
X1=runif(50),
X2=factor(sample(letters[1:4], 50, replace=T)),
X3=rpois(50, 5),
Outcome = sample(0:1, 50, replace=T)
)
And we can fit a standard binomial model with
mymodel<-glm(Outcome~X1+X2+X3, data=dd, family=binomial)
Which gives
Call: glm(formula = Outcome ~ X1 + X2 + X3, family = binomial, data = dd)
Coefficients:
(Intercept) X1 X2b X2c X2d X3
-0.4411 0.8853 1.8384 0.9455 1.5059 -0.1818
Degrees of Freedom: 49 Total (i.e. Null); 44 Residual
Null Deviance: 68.03
Residual Deviance: 62.67 AIC: 74.67
Now let's say we wanted to try out model that we read in a publication on the same data. Here's how we use the makeglm
function
newmodel <- makeglm(Outcome~X1+X2+X3, binomial, data=dd,
-.5, X1=1, X2=c(b=1.5, c=1, d=1.5), X3=-.15)
The first parameter is the formula of the model. This defines the response and the covariates just like you would when running glm
. Next you specify the family like you would with glm()
. And you need to pass a data frame so R can sniff the correct data types for each of the variables involved. This will also identify all of the factor variables and their levels using the data.frame. So this can be new data that's coded just like the fitted data.frame or it can be the original one.
Now we start to specify the coefficients to use in our model. The coefficients will be filled using the names of the parameters. The unnamed parameter will be used as the intercept. If you have a factor, you need to give coefficients to all the levels via named parameters. Here I just decided to round off the fitted estimates to "nice" numbers.
Now I can use our newmodel
with predict.
predict(mymodel, type="response")
# 1 2 3 4 5
# 0.4866398 0.3553439 0.6564668 0.7819917 0.3008108
predict(newmodel, newdata=dd, type="response")
# 1 2 3 4 5
# 0.5503572 0.4121811 0.7143200 0.7942776 0.3245525
Here I call predict on the original model and on the new model using the old data with my specified coefficients. We can see the estimate of probability have changed a bit.
Now I haven't thoroughly tested this function so use at your own risk. I don't do as much error checking as I probably should. Maybe someone else does know of a better way.