Search code examples
rfor-looplogistic-regression

For loop for sequentially adjusted regression in R


Most questions related to traditional loops in R are explained away by using functionals with less code and, generally, being more flexible.

However, please correct me, I feel when the order of iterations are important, for loops would still dominate.

In my case, I would like to build a sequentially and cumulatively adjusted logistic regression model, store the OR/CIs along with a column showing the what are being adjusted for. This is my expected output:

 Model        OR     CI

 Biomarker
 +Age
 +Sex
 +Smoking 

Here's what I did:

df1 <- subset(df, select = c(age_cat, is_female, smoking_category,
                                 bmi_calc, has_diabetes, sbp_mean, 
                                 alcohol_category, highest_education,
                                 occupation, household_income))
model <- data.frame(NULL)

for (i in seq_along(df1)) {

  model <- exp((cbind(OR = coef(glm(as.formula(paste("istroke ~ log2(hscrp_mgl)", i, sep = "+")), 
                         family=binomial, data=df)),
           confint(glm(as.formula(paste("istroke ~ log2(hscrp_mgl)", i, sep = "+")), 
                       family=binomial, data=df)))))


}

My outcome variable is stroke (istroke, 0 or 1). My exposure of interest is the biomarker (hscrp_mgl). I know I'm making a fundamental mistake somewhere. I looked for in other SO posts but most of them don't want sequentially and cumulatively adjusted regression models.

Please let me know if this is a duplicate, nonetheless and if anything's unclear.

EDIT

My original dataset df contains all the variables of df1, my outcome variable and then some. Here is a reproducible sample of it:

age_cat is_female   smoking_category    bmi_calc    has_diabetes        sbp_mean    istroke
(59,69]        0           4            19.6           0                103.5          0
(59,69]        1           1            19.1           0                 138           0
(29,59]        0           4            26.8           0               155.5           0
(29,59]        0           1            23.1           0                 130           1
(29,59]        1           1            22.7           0                 126           1
(59,69]        0           4             25            0               182.5           0
(29,59]        1           1             20            0                  96           1
(29,59]        1           2             23.9          0               134.5           0
(59,69]        0           4             24.4          0               160.5           1

EDIT A more reproducible example:

df <- data.frame(age = c(50, 60, 50, 40, 70, 90, 30),
             gender = c(0, 1, 1, 0, 1, 1, 1),
             smoke = c(4, 3, 2, 1, 4, 3, 4),
             BMI = c(19, 20, 21, 22, 23, 24, 25),
             SBP = c(100, 120, 140, 110, 120, 130, 120),
             diab = c(0, 1, 1, 1, 0, 1, 1),
             stroke = c(0, 1, 0, 0, 1, 1, 1))
dput(df)
structure(list(age = c(50, 60, 50, 40, 70, 90, 30), gender = c(0, 
1, 1, 0, 1, 1, 1), smoke = c(4, 3, 2, 1, 4, 3, 4), BMI = c(19, 
20, 21, 22, 23, 24, 25), SBP = c(100, 120, 140, 110, 120, 130, 
120), diab = c(0, 1, 1, 1, 0, 1, 1), stroke = c(0, 1, 0, 0, 1, 
1, 1)), .Names = c("age", "gender", "smoke", "BMI", "SBP", "diab", 
"stroke"), row.names = c(NA, -7L), class = "data.frame")

Solution

  • Actually, lapply might be a better approach in your case over for as it can return a collection of data.frames for final row bind instead of expanding model iteratively.

    Below example randomizes hscrp_mgl as it was not in posted data. So ignore results but consider process. Additionally, confidence interval is split between low and high in different columns.

    set.seed(456)
    df <- data.frame(hscrp_mgl = abs(rnorm(250)),
                     age = sample(100, 1000, replace=TRUE),
                     gender = sample(0:1, 1000, replace=TRUE),
                     smoke = sample(1:4, 1000, replace=TRUE),
                     BMI = sample(19:25, 1000, replace=TRUE),
                     SBP = sample(c(100, 120, 140, 110, 120, 130, 120),
                                  1000, replace=TRUE),
                     diab = sample(0:1, 1000, replace=TRUE),
                     stroke = sample(0:1, 1000, replace=TRUE))  
    
    # ITERATE THROUGH COLUMN NUMBERS (SUBSETTING OUT FIRST AND LAST)
    modeldfs <- lapply(seq_along(df)[3:ncol(df)-1], function(i) {
      strf <- paste("stroke ~ log2(hscrp_mgl)", 
                    paste(names(df)[2:i], collapse = "+"), sep = "+")
      print(strf)
    
      # FIT DYNAMIC CUMULATIVE FORMULA USING names() TO PASS IN COLUMN NAME
      fit <- glm(as.formula(strf), family=binomial, data=df)
    
      # BIND MODEL STATS
      data.frame(OR = exp(coef(fit)[i+1]), 
                 CI_2.5 = exp(confint(fit)[i+1,1]), 
                 CI_97.5 = exp(confint(fit)[i+1,2]))
    })
    
    model <- do.call(rbind, modeldfs)
    model
    

    Output

    [1] "stroke ~ log2(hscrp_mgl)+age"
    # Waiting for profiling to be done...
    # Waiting for profiling to be done...
    [1] "stroke ~ log2(hscrp_mgl)+age+gender"
    # Waiting for profiling to be done...
    # Waiting for profiling to be done...
    [1] "stroke ~ log2(hscrp_mgl)+age+gender+smoke"
    # Waiting for profiling to be done...
    # Waiting for profiling to be done...
    [1] "stroke ~ log2(hscrp_mgl)+age+gender+smoke+BMI"
    # Waiting for profiling to be done...
    # Waiting for profiling to be done...
    [1] "stroke ~ log2(hscrp_mgl)+age+gender+smoke+BMI+SBP"
    # Waiting for profiling to be done...
    # Waiting for profiling to be done...
    [1] "stroke ~ log2(hscrp_mgl)+age+gender+smoke+BMI+SBP+diab"
    # Waiting for profiling to be done...
    # Waiting for profiling to be done...
    # > model <- do.call(rbind, modeldfs)
    # > model
                 OR    CI_2.5  CI_97.5
    age    1.003285 0.9989043 1.007701
    gender 1.067117 0.8318796 1.369055
    smoke  1.005926 0.9005196 1.123717
    BMI    1.011281 0.9505659 1.075928
    SBP    1.003252 0.9929368 1.013692
    diab   1.139586 0.8880643 1.462925