I have a dataset "rf1" of 845 features and 1052 rows and want to eliminate, in order to do ML, the highly correlated features. I made this code but it shows me features and correlations without eliminate them...
`corr_simple<-function(rf1,sig=0.9)
{df_cor <- rf1 %>% mutate_if(is.character, as.factor)
df_cor <- df_cor %>% mutate_if(is.factor, as.numeric)
corr<-cor(df_cor)
corr[lower.tri(corr,diag=TRUE)] <- NA
corr[corr == 1] <- NA
corr <- as.data.frame(as.table(corr))
corr <- na.omit(corr)
corr <- subset(corr, abs(Freq) > sig)
corr <- corr[order(-abs(corr$Freq)),]
print(corr)
mtx_corr <- reshape2::acast(corr, Var1~Var2,value.var="Freq")}
corr_simple(rf1)`
here is the result but I want to eliminate the variables with a threshold of 0.9 MY RESULTS
When I use functions found here like this one I have an error message like this :
`data<-data.frame(rf1)
cor_matrix <- cor(data)
cor_matrix_rm <- cor_matrix
cor_matrix_rm[upper.tri(cor_matrix_rm)] <- 0
diag(cor_matrix_rm) <- 0
cor_matrix_rm
data_new <- data[ , !apply(cor_matrix_rm, 2, function(x) any(x > 0.90))]
Error in [.data.frame(data, , !apply(cor_matrix_rm, 2, function(x) any(x > :
undefined columns selected`
I searched and tried other solutions but always this problem...
You could do it with a loop. Here's an example using mtcars
. You set the threshold to r_threshold
(.8 in the example below). You loop over the columns of mtcars
, each time removing the columns that have an absolute value of the correlation about the pre-defined threshold. After the relevant columns have been removed, it moves on to the next column, leaving the ones that have not been removed in previous steps. Notice that cyl
, disp
and wt
have been removed (you can see this by the difference in the column names before and after the loop.
data(mtcars)
colnames(mtcars)
#> [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
#> [11] "carb"
r_threshold <- .8
keep_going <- TRUE
i <- 1
while(keep_going){
s <- seq(i+1, ncol(mtcars))
r <- cor(mtcars[,s], mtcars[,i])
if(any(abs(r) > r_threshold)){
mtcars <- mtcars[, -s[which(abs(r) > r_threshold)]]
}
i <- i+1
if(ncol(mtcars) <= i){
keep_going <- FALSE
}
}
colnames(mtcars)
#> [1] "mpg" "hp" "drat" "qsec" "vs" "am" "gear" "carb"
Created on 2023-02-09 by the reprex package (v2.0.1)