I have a series of data frames, each representing a linear model. I want to automatically remove columns from each data frame based on a threshold of 10 for the VIF criteria. A given data frame looks like this:
df_nn <- structure(list(capital = c(100, 101, 102, 103,
104, 105, 106, 107, 108, 109,
110, 111, 112, 113, 114, 115,
116, 117, 118, 119, 120, 121,
122, 123, 124, 125, 126, 127,
128, 129, 130, 131, 132), IVAE = c(109.19,
110.09, 111.84, 112.49, 111.99, 113.11, 111.89, 112.11, 112.75,
113.7, 112.93, 112.43, 114.88, 114.5, 114.93, 115.13, 105.54,
91.71, 87.93, 93.06, 96.74, 103.26, 106.76, 109.6, 110.74, 112,
112.73, 114.97, 115.01, 114.67, 115.78, 114.52, 111.91), `Índice de Producción Industrial (IPI): Industrias Manufactureras, Explotación de Minas y Canteras y Otras Actividades Industriales` = c(101.4,
103.4, 106.72, 108.45, 107.76, 107.25, 105.75, 107.03, 107.31,
106.61, 106.95, 106.61, 110.18, 108.68, 109.66, 111.32, 100.02,
76.77, 73.46, 81.99, 94.83, 100.64, 104.51, 106.74, 107.04, 108.75,
110.8, 110.59, 111.25, 108.82, 110.03, 111.32, 107.61), Construcción = c(112.25,
117.5, 124.32, 122.64, 121.21, 128.69, 122.28, 126.55, 120.13,
137.47, 129.82, 126.83, 132.92, 131.72, 137.56, 130.89, 117.08,
87.62, 67.49, 79.56, 88.97, 117.57, 110.01, 118.02, 117.61, 121.64,
120.76, 120.99, 118.96, 122.7, 122.59, 101.2, 106.3), `Comercio, Transporte y Almacenamiento, Actividades de Alojamiento y de Servicio de Comidas` = c(112.2,
113.03, 115.69, 113.74, 114.7, 115.93, 115.3, 114.25, 115.05,
116.68, 114.84, 114.56, 116.58, 117.77, 119.19, 119.15, 103.41,
76.66, 75.21, 90.32, 91.72, 97.53, 105.21, 110.43, 109.72, 112.41,
114.05, 115.88, 117.29, 115.05, 114.69, 116.79, 109.68), `Actividades Inmobiliarias` = c(113.31,
113.83, 114.69, 114.97, 115.98, 116.2, 116.22, 115.64, 115.79,
115.95, 116.24, 117.6, 117.84, 115.35, 108.98, 105.89, 103.74,
103.16, 102.5, 102.42, 102.41, 104.16, 107.74, 112.87, 116.57,
115.68, 113.47, 112.41, 112.08, 112.42, 112.74, 113.21, 112.56
), `Actividades Profesionales, Científicas, Técnicas, Administrativas, de Apoyo y Otros Servicios` = c(111.84,
111.92, 116.44, 117.77, 112.96, 114.64, 113.67, 112.33, 115.12,
113.31, 114.14, 115.46, 117.17, 120.57, 124.26, 122.68, 99.51,
86.36, 79.21, 81.56, 83.6, 88.71, 97.76, 98.16, 101.04, 102.68,
108.37, 113.64, 114.82, 115.91, 118.35, 118.74, 109.14), empleo = c(851413,
856079, 853309, 854541, 856040, 853881, 853328, 858454, 860200,
861430, 865033, 867569, 874276, 870793, 872645, 876928, 873733,
840029, 813159, 805474, 808920, 814118, 824284, 833293, 841311,
842072, 848832, 854290, 859130, 860833, 865704, 873081, 881033
)), row.names = c(NA, -33L), class = c("tbl_df", "tbl", "data.frame"
))
Where "capital" is the dependent variable and the remaining columns are the independent variables, all of them numeric.
So far, I have tried the following function for a single data frame:
library(car)
vif_fun <- function(df){
while(TRUE) {
vifs <- vif(lm(capital ~. , data = df))
if (max(vifs) < 10) {
break
}
highest <- c(names((which(vifs == max(vifs)))))
df <- df[,-which(names(df) %in% highest)]
}
return(df)
}
vif_fun(df_nn)
As long as there is a variable with a VIF above 10, the function should:
However, whenever I run the function, I get the following error message:
Error in terms.formula(formula, data = data) :
'.' in formula and no 'data' argument
I tried the function with the mtcars data set, replacing "capital" for "mpg" in the function and it worked. Any ideas of what might be going on?
An easier option is to make use of clean_names
from janitor
which does replace the non-specific column names
vif_fun <- function(df){
df <- janitor::clean_names(df)
while(TRUE) {
vifs <- vif(lm(capital ~. , data = df))
if (max(vifs) < 10) {
break
}
highest <- c(names((which(vifs == max(vifs)))))
df <- df[,-which(names(df) %in% highest)]
}
return(df)
}
vif_fun(df_nn)