Search code examples
rlinear-regressionmultivariate-testing

Stepwise Multivariate Linear Regression in RStudio Stalling


My data has 6 independent variables and 5 dependent variables, and I am running stepwise multivariate linear regressions on them. My code removes any independent variables with p-values > 0.05 in the first regressions, runs a regression again, removing independent variables with p-values > 0.05, etc. in a loop until only those independent variables with p-values < 0.05 remain.

The R code works beautifully through the first three dependent variables, so I know it works, but it stalls when it begins to print the final summary of the regression on Dependent Variable 4.

In this sample data, it works through the first dependent variable only, then stalls, illustrated: RStudio Console Output Stall

Sample Data:

df <- structure(list(IndVar1 = c(114156L, 93447L, 113909L, 103983L, 
111403L, 93071L, 115366L, 98578L, 98487L, 105592L, 102727L, 108970L, 
112604L, 103223L, 95176L, 100051L), IndVar2 = c(131937L, 128287L, 
108688L, 94270L, 137057L, 100127L, 130482L, 109484L, 97535L, 
139141L, 122752L, 124523L, 111948L, 92846L, 104659L, 112592L), 
    IndVar3 = c(10967L, 9733L, 9485L, 10251L, 9426L, 6040L, 7413L, 
    6648L, 9862L, 6605L, 9364L, 8749L, 8621L, 10285L, 9700L, 
    8926L), IndVar4 = c(16260L, 13994L, 14936L, 11645L, 15831L, 
    11916L, 11227L, 14199L, 16051L, 15330L, 14205L, 13756L, 14175L, 
    12530L, 15572L, 15048L), IndVar5 = c(9772L, 8682L, 6849L, 
    10667L, 9223L, 7591L, 10003L, 7660L, 6431L, 8149L, 9111L, 
    10551L, 8448L, 10296L, 6291L, 10131L), IndVar6 = c(12643L, 
    15848L, 11522L, 11114L, 15946L, 16181L, 13360L, 11343L, 12682L, 
    13412L, 16175L, 11187L, 16494L, 14926L, 14569L, 12954L), 
    DepVar1 = c(1674605L, 1731694L, 1910052L, 1955383L, 1698939L, 
    1563494L, 1573640L, 1836676L, 1726679L, 1768668L, 1679431L, 
    1890720L, 1837590L, 1677332L, 1794722L, 1715671L), DepVar2 = c(12721L, 
    12169L, 9924L, 4358L, 9992L, 5290L, 11456L, 10956L, 7254L, 
    4089L, 12761L, 10360L, 4999L, 5387L, 8468L, 7446L), DepVar3 = c(2.55, 
    1.73, 1.08, 0.93, 1.88, 1.77, 2.62, 2.13, 2.32, 1.68, 0.99, 
    1.84, 1.4, 1.34, 0.93, 2), DepVar4 = c(42820L, 28394L, 38820L, 
    50756L, 30729L, 39528L, 30783L, 39551L, 38429L, 46858L, 29727L, 
    52171L, 37714L, 30164L, 43782L, 43410L), DepVar5 = c(2.7, 
    2.59, 1.93, 1.48, 2.03, 1.81, 1.63, 1.53, 2.7, 1.87, 2.33, 
    2.15, 2.09, 2.35, 2.48, 1.54)), class = "data.frame", row.names = c(NA, 
-16L))

I have run all of the stepwise regressions in the very painful/manual regression analysis in Excel and get good results for all dependent variables.

This is the code:

# Function to perform stepwise regression
perform_stepwise_regression <- function(dependent_variable, independent_variables, data) {
  cat("======================================================================\n")
  cat("Dependent Variable: ", dependent_variable, "\n")
  cat("======================================================================\n")
  
  # Fit initial model
  lm_model <- lm(as.formula(paste(dependent_variable, "~", paste(independent_variables, collapse = " + "))), data = df)
  
   while (TRUE) {
    # Check p-values
    p_values <- summary(lm_model)$coefficients[, 4]
    
    # Identify variables with the highest p-value > 0.05
    variable_to_remove <- names(p_values[p_values == max(p_values)])
    
    # Break the loop if no variable has a p-value > 0.05
    if (max(p_values) <= 0.05) {
      break
    }
    
    # Create a new formula excluding the variable
    predictors_to_include <- setdiff(names(lm_model$model)[-1], variable_to_remove)
    
    new_formula <- as.formula(paste(dependent_variable, "~", paste(predictors_to_include, collapse = " + ")))
    
    # Fit a new model
    lm_model <- lm(new_formula, data = df)
  }
  
  # Create the final formula using the single remaining independent variable
  final_formula <- as.formula(paste(dependent_variable, "~", predictors_to_include))
  
  # Fit the final model using the created formula
  final_model <- lm(final_formula, data = df)
  
 # Print the final summary in the console
  print(summary(lm_model))
}

# Example usage:
dependent_variables <- c("DepVar1", "DepVar2", "DepVar3", "DepVar4", "DepVar5")
independent_variables <- c("IndVar1", "IndVar2", "IndVar3", "IndVar4", "IndVar5", "IndVar6")

for (dependent_variable in dependent_variables) {
  perform_stepwise_regression(dependent_variable, independent_variables, df)
}

Google Gemini is worthless here, and ChatGPT is playing me. I need human intervention. Does anyone know what might be going on?


Solution

  • For the regression of DepVar2, the least significant term in the initial model is the intercept. When you create the new formula, you are then trying to remove "(Intercept)" from a vector that does not include "(Intercept)", which just returns the same vector. Therefore, the next iteration uses the exact same independent variables, and you are stuck in an infinite loop. It works fine for DepVar1 because the intercept is significant.

    You could fix this by dropping the intercept from p_values, e.g., using p_values[!(names(p_values)=='(Intercept)')]. I'm assuming you want to keep the intercept in the model even if it's not significant.