Search code examples
rregressionextractsimulationlogistic-regression

Simulating logistic regression from saved estimates in R


I have a bit of an issue. I am trying to develop some code that will allow me to do the following: 1) run a logistic regression analysis, 2) extract the estimates from the logistic regression analysis, and 3) use those estimates to create another logistic regression formula that I can use in a subsequent simulation of the original model. As I am, relatively new to R, I understand I can extract these coefficients 1-by-1 through indexing, but it is difficult to "scale" this to models with different numbers of coefficients. I am wondering if there is a better way to extract the coefficients and setup the formula. Then, I would have to develop the actual variables, but the development of these variables would have to be flexible enough for any number of variables and distributions. This appears to be easily done in Mplus (example 12.7 in the Mplus manual), but I haven't figured this out in R. Here is the code for as far as I have gotten:

#generating the data
set.seed(1)
gender <- sample(c(0,1), size = 100, replace = TRUE)
age <- round(runif(100, 18, 80))
xb <- -9 + 3.5*gender + 0.2*age
p <- 1/(1 + exp(-xb))
y <- rbinom(n = 100, size = 1, prob = p)

#grabbing the coefficients from the logistic regression model
matrix_coef <- summary(glm(y ~ gender + age, family = "binomial"))$coefficients

the_estimates <- matrix_coef[,1]
the_estimates

the_estimates[1]
the_estimates[2]
the_estimates[3]

I just cannot seem to figure out how to have R create the formula with the variables (x's) and the coefficients from the original model in a flexible manner to accommodate any number of variables and different distributions. This is not class assignment, but a necessary piece for the research that I am producing. Any help will be greatly appreciated, and please, treat this as a teaching moment. I really want to learn this.


Solution

  • I'm not 100% sure what your question is here.

    If you want to simulate new data from the same model with the same predictor variables, you can use the simulate() method:

    dd <- data.frame(y, gender, age)
    ## best practice when modeling in R: take the variables from a data frame
    model <- glm(y ~ gender + age, data = dd, family = "binomial")
    simulate(model)
    

    You can create multiple replicates by specifying the nsim= argument (or you can simulate anew every time through a for() loop)

    If you want to simulate new data from a different set of predictor variables, you have to do a little bit more work (some model types in R have a newdata= argument, but not GLMs alas):

    ## simulate new model matrix (including intercept)
    simdat <- cbind(1,
                    gender = rbinom(100, prob = 0.5, size = 1),
                    age = sample(18:80, size = 100, replace = TRUE))
    ## extract inverse-link function
    invlink <- family(model)$linkinv 
    ## sample new values
    resp <- rbinom(n = 100, size = 1, prob = invlink(simdat %*% coef(model)))
    

    If you want to do this later from coefficients that have been stored, substitute the retrieved coefficient vector for coef(model) in the code above.

    If you want to flexibly construct formulas, reformulate() is your friend — but I don't see how it fits in here.


    If you want to (say) re-fit the model 1000 times to new responses simulated from the original model fit (same coefficients, same predictors: i.e. a parametric bootstrap), you can do something like this.

    nsim <- 1000
    res <- matrix(NA, ncol = length(coef(model)), nrow = nsim)
    for (i in 1:nsim) {
       ## simulate returns a list (in this case, of length 1);
       ## extract the response vector 
       newresp <- simulate(model)[[1]]
       newfit <- update(model, newresp ~ .)
       res[i,] <- coef(newfit)
    }
    

    You don't have to store coefficients - you can extract/compute whatever model summaries you like (change the number of columns of res appropriately).