Search code examples
rfunctionsubsetcorrelation

Remove highly correlated variable and keep the low correlated


I have a dataset "rf1" of 845 features and 1052 rows and want to eliminate, in order to do ML, the highly correlated features. I made this code but it shows me features and correlations without eliminate them...

`corr_simple<-function(rf1,sig=0.9)
{df_cor <- rf1 %>% mutate_if(is.character, as.factor)
df_cor <- df_cor %>% mutate_if(is.factor, as.numeric)
corr<-cor(df_cor)
corr[lower.tri(corr,diag=TRUE)] <- NA 
corr[corr == 1] <- NA 
corr <- as.data.frame(as.table(corr))
corr <- na.omit(corr) 
corr <- subset(corr, abs(Freq) > sig) 
corr <- corr[order(-abs(corr$Freq)),] 
print(corr)
mtx_corr <- reshape2::acast(corr, Var1~Var2,value.var="Freq")}
corr_simple(rf1)`

here is the result but I want to eliminate the variables with a threshold of 0.9 MY RESULTS

When I use functions found here like this one I have an error message like this :

`data<-data.frame(rf1)
cor_matrix <- cor(data)
cor_matrix_rm <- cor_matrix                 
cor_matrix_rm[upper.tri(cor_matrix_rm)] <- 0
diag(cor_matrix_rm) <- 0
cor_matrix_rm
data_new <- data[ , !apply(cor_matrix_rm, 2, function(x) any(x > 0.90))]
Error in [.data.frame(data, , !apply(cor_matrix_rm, 2, function(x) any(x >  : 
  undefined columns selected`

I searched and tried other solutions but always this problem...


Solution

  • You could do it with a loop. Here's an example using mtcars. You set the threshold to r_threshold (.8 in the example below). You loop over the columns of mtcars, each time removing the columns that have an absolute value of the correlation about the pre-defined threshold. After the relevant columns have been removed, it moves on to the next column, leaving the ones that have not been removed in previous steps. Notice that cyl, disp and wt have been removed (you can see this by the difference in the column names before and after the loop.

    data(mtcars)
    colnames(mtcars)
    #>  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
    #> [11] "carb"
    
    r_threshold <- .8
    keep_going <- TRUE
    i <- 1
    while(keep_going){
      s <- seq(i+1, ncol(mtcars))
      r <- cor(mtcars[,s], mtcars[,i])
      if(any(abs(r) > r_threshold)){
        mtcars <- mtcars[, -s[which(abs(r) > r_threshold)]]
      }
      i <- i+1
      if(ncol(mtcars) <= i){
        keep_going <- FALSE
      }
    }
    colnames(mtcars)
    #> [1] "mpg"  "hp"   "drat" "qsec" "vs"   "am"   "gear" "carb"
    

    Created on 2023-02-09 by the reprex package (v2.0.1)