Search code examples
rmodelregressionbuilding

Manually build logistic regression model for prediction in R


I'm attempting to test a logistic regression model (e.g. 3 coefficients for 3 predictor variables, X1,X2,X3), on a dataset. I'm aware of how to test a model after i created the model object using, for example,

mymodel <- glm( Outcome ~  X1 + X2 + X3 , family = binomial,data=trainDat)

and then test the data

prob <- predict(mymodel,type="response",newdata=test)

But i want to, now, create a logistic model using coefficients and intercept that I have, and then test this model on data.

Basically I'm not clear on how to create "mymodel" without running glm.

Context for the question: I've run a logistic regression using offsets eg:

mymodel <- glm(Outcome ~ offset(C1 * X1) + offset(C2 * X2) + X3, 
               family = binomial, data = trainDat)

Thus, the mymodel object generates a model with only an intercept (I) and C3 coefficients (for feature X3).
I now need to test the full model (i.e. I + C1*X1 + C2*X2 + C3*X3), on a test dataset, but I don't know how to get the full model, since the output of mymodel has only intercept and C3. So I thought my more general question was: "how do you manually build a logisitic regression model object?"

Thank you for your patience.


Solution

  • I could not find a simple function to do this. There is some code in the predict function that depends on having a fitted model (like determining the rank of the model). However, we can create a function to make a fake glm object that you can use with predict. Here's my first attempt at such a function

    makeglm <- function(formula, family, data=NULL, ...) {
        dots <- list(...)
        out<-list()
        tt <- terms(formula, data=data)
        if(!is.null(data)) {
            mf <- model.frame(tt, data)
            vn <- sapply(attr(tt, "variables")[-1], deparse)
    
            if((yvar <- attr(tt, "response"))>0)
                vn <- vn[-yvar]
                xlvl <- lapply(data[vn], function(x) if (is.factor(x))
               levels(x)
            else if (is.character(x))
               levels(as.factor(x))
            else
                NULL)
            attr(out, "xlevels") <- xlvl[!vapply(xlvl,is.null,NA)]
            attr(tt, "dataClasses") <- sapply(data[vn], stats:::.MFclass)
        }
        out$terms <- tt
        coef <- numeric(0)
        stopifnot(length(dots)>1 & !is.null(names(dots)))
        for(i in seq_along(dots)) {
            if((n<-names(dots)[i]) != "") {
                v <- dots[[i]]
                if(!is.null(names(v))) {
                    coef[paste0(n, names(v))] <- v
                } else {
                    stopifnot(length(v)==1)
                    coef[n] <- v
                }
            } else {
                coef["(Intercept)"] <- dots[[i]]
            }   
        }
        out$coefficients <- coef
        out$rank <- length(coef)
        out$qr <- list(pivot=seq_len(out$rank))
        out$family <- if (class(family) == "family") {
            family
        } else if (class(family) == "function") {
            family()
        } else {
            stop(paste("invalid family class:", class(family)))
        }
        out$deviance <- 1
        out$null.deviance <- 1
        out$aic <- 1
        class(out) <- c("glm","lm")
        out
    }
    

    So this function creates an object and passes all the values that predict and print expect to find on such an object. Now we can test it out. First, here's some test data

    set.seed(15)
    dd <- data.frame(
        X1=runif(50),
        X2=factor(sample(letters[1:4], 50, replace=T)),
        X3=rpois(50, 5),
        Outcome = sample(0:1, 50, replace=T)
    )
    

    And we can fit a standard binomial model with

    mymodel<-glm(Outcome~X1+X2+X3, data=dd, family=binomial)
    

    Which gives

    Call:  glm(formula = Outcome ~ X1 + X2 + X3, family = binomial, data = dd)
    
    Coefficients:
    (Intercept)           X1          X2b          X2c          X2d           X3  
        -0.4411       0.8853       1.8384       0.9455       1.5059      -0.1818  
    
    Degrees of Freedom: 49 Total (i.e. Null);  44 Residual
    Null Deviance:      68.03 
    Residual Deviance: 62.67    AIC: 74.67
    

    Now let's say we wanted to try out model that we read in a publication on the same data. Here's how we use the makeglm function

    newmodel <- makeglm(Outcome~X1+X2+X3, binomial, data=dd, 
        -.5, X1=1, X2=c(b=1.5, c=1, d=1.5), X3=-.15)
    

    The first parameter is the formula of the model. This defines the response and the covariates just like you would when running glm. Next you specify the family like you would with glm(). And you need to pass a data frame so R can sniff the correct data types for each of the variables involved. This will also identify all of the factor variables and their levels using the data.frame. So this can be new data that's coded just like the fitted data.frame or it can be the original one.

    Now we start to specify the coefficients to use in our model. The coefficients will be filled using the names of the parameters. The unnamed parameter will be used as the intercept. If you have a factor, you need to give coefficients to all the levels via named parameters. Here I just decided to round off the fitted estimates to "nice" numbers.

    Now I can use our newmodel with predict.

    predict(mymodel, type="response")
    #         1         2         3         4         5
    # 0.4866398 0.3553439 0.6564668 0.7819917 0.3008108
    
    predict(newmodel, newdata=dd, type="response")
    
    #         1         2         3         4         5
    # 0.5503572 0.4121811 0.7143200 0.7942776 0.3245525
    

    Here I call predict on the original model and on the new model using the old data with my specified coefficients. We can see the estimate of probability have changed a bit.

    Now I haven't thoroughly tested this function so use at your own risk. I don't do as much error checking as I probably should. Maybe someone else does know of a better way.