Search code examples
data-sciencedata-analysis

Data Selection - Finding relations between dataframe attributes


let's say i have a dataframe of 80 columns and 1 target column, for example a bank account table with 80 attributes for each record (account) and 1 target column which decides if the client stays or leaves. what steps and algorithms should i follow to select the most effective columns with the higher impact on the target column ?


Solution

  • This one way to do it using the Pearson correlation coefficient in Rstudio, I used it once when exploring the red_wine dataset my targeted variable or column was the quality and I wanted to know the effect of the rest of the columns on it. see below figure shows the output of the code as you can see the blue color represents positive relation and red represents negative relations and the closer the value to 1 or -1 the darker the colorcode output

    c <- cor(
          red_wine %>%
            # first we remove unwanted columns
            dplyr::select(-X) %>%
            dplyr::select(-rating) %>%
            mutate(
              # now we translate quality to a number
              quality = as.numeric(quality)
            )
        )
    
        corrplot(c, method = "color", type = "lower", addCoef.col = "gray", title = "Red Wine Variables Correlations", mar=c(0,0,1,0), tl.cex = 0.7, tl.col = "black", number.cex = 0.9)