Search code examples
rfeature-selection

How do you use a Pearson correlation to select features in `R`?


Pearson correlation can help in feature selection. For example, here we read:

enter image description here

where Y is the target and Xi the feature. I would like to estimate the metric for each of the pair (feature, target). But I have also a categorical feature (x4): how could I proceed?

> dput(df)
structure(list(x1 = c(1090, 1020, 883, 209, 1, 1, 0, 3, 3, 2, 
2, 17, 11, 15, 1, 21, 12, 15, 6, 5, 9, 10, 15, 20, 23, 34, 22, 
29, 31, 16), y = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), x2 = c(1, 1, 
1, 76, 1e+07, 1e+07, 1e+07, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
2, 2, 7, 1, 2, 2, 1, 1, 1, 3, 1, 1), x3 = c(2052.29646840149, 
1086.36659663866, 5229.23948598131, 280.329896907216, 9844331, 
9844331, 9844331, 14776.3333333333, 14776.3333333333, 2239.33333333333, 
2239.33333333333, 52526.25, 104597.666666667, 7341.42857142857, 
9844331, 3394.73684210526, 6565.28, 6565.28, 10738.5, 10738.5, 
6289, 10253.4, 6948.41379310345, 6948.41379310345, 1946.73076923077, 
1946.73076923077, 18460.15, 8886.61538461538, 43386.1153846154, 
7513.66666666667), x4 = c("Fr", "Tu", "We", "Su", "Mo", "Mo", 
"We", "Su", "Su", "Su", "Su", "Sa", "Fr", "We", "Mo", "Mo", "Su", 
"Su", "Sa", "Sa", "Th", "Fr", "Mo", "Mo", "Mo", "Mo", "Sa", "Sa", 
"Th", "Sa"), x5 = c(1, 18, 22, 16, 3, 3, 15, 19, 19, 21, 21, 
15, 16, 5, 7, 7, 11, 11, 9, 9, 19, 16, 0, 0, 0, 0, 13, 3, 17, 
7), x6 = c(147, 139, 139, 139, 134, 126, 95, 95, 95, 147, 139, 
138, 138, 138, 138, 138, 138, 138, 138, 138, 138, 138, 147, 139, 
139, 139, 139, 139, 139, 139), x7 = c(2, 2, 2, 2, 1, 1, 2, 2, 
2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 
3), x8 = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), x9 = c(0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0)), row.names = c(NA, -30L), class = c("tbl_df", "tbl", 
"data.frame"))

Solution

  • We could use cor() function. To correlate only numeric columns we could use sapply:

    round(cor(df[sapply(df,is.numeric)]),
          digits = 2
    )
    

    Output:

          x1  y    x2    x3    x5    x6    x7 x8 x9
    x1  1.00 NA -0.13 -0.16  0.11  0.20 -0.36 NA NA
    y     NA  1    NA    NA    NA    NA    NA NA NA
    x2 -0.13 NA  1.00  0.85 -0.17 -0.39 -0.72 NA NA
    x3 -0.16 NA  0.85  1.00 -0.20 -0.32 -0.57 NA NA
    x5  0.11 NA -0.17 -0.20  1.00 -0.29 -0.03 NA NA
    x6  0.20 NA -0.39 -0.32 -0.29  1.00  0.44 NA NA
    x7 -0.36 NA -0.72 -0.57 -0.03  0.44  1.00 NA NA
    x8    NA NA    NA    NA    NA    NA    NA  1 NA
    x9    NA NA    NA    NA    NA    NA    NA NA  1