I'm trying to run a backward stepwise regression in R, but I might not know how to accomplish what I want.
This is an example of what I want to do: Using the Iris dataset, I have a simple regression with fixed effects for Species:
lm(Petal.Length ~ Petal.Width + Sepal.Width + Sepal.Length + as.factor(Species), data = iris)
I want to run something similar to a backward stepwise regression, but instead of removing variables, I want to remove categories (subsampling) from Species until my variables stop being significant.
How could I do that?
I achieved the results I wanted with a couple of loops, but I hoped for a more efficient or simpler solution:
# Install and load necessary packages
pacman::p_load(dplyr, broom, writexl, beepr)
# Assuming iris is the dataframe
# Model: lm(Petal.Length ~ Petal.Width + Sepal.Width + Sepal.Length + as.factor(Species), data = iris)
# Define a function to run the regression model and return the results
run_model <- function(data, formula) {
model <- lm(formula, data = data)
# Extract model summary
summary_model <- summary(model)
# Extract coefficients, p-values, and other details
coefficients <- summary_model$coefficients
# Return a list with AIC and coefficients
return(list(coefficients = coefficients))
}
# Store initial model results for the full model
full_model_results <- run_model(iris,
Petal.Length ~ Petal.Width + Sepal.Width + Sepal.Length + as.factor(Species))
# Initialize a list to store results
results <- list(full_model = full_model_results)
# Get the unique levels of Species
species <- unique(iris$Species)
# Loop to drop one cat at a time
for (i in 0:(length(species)-1)) {
# Subset data to exclude the current set of species
subset_data <- dplyr::filter(iris, !Species %in% species[1:i])
# Check if there are at least two levels of Species left
if (length(unique(subset_data$Species)) < 2) {
next
}
# Run the regression model on the subset data
current_results <- run_model(subset_data,
Petal.Length ~ Petal.Width + Sepal.Width + Sepal.Length + as.factor(Species))
# Store the results with a descriptive name
results[[paste0("drop_", i, "_species")]] <- current_results
}
# Format and display the results in a more readable format
formatted_results <- lapply(results, function(res) {
list(
Coefficients = as.data.frame(res$coefficients)
)
})