Pearson correlation can help in feature selection. For example, here we read:
where Y
is the target and Xi
the feature. I would like to estimate the metric for each of the pair (feature, target)
. But I have also a categorical feature (x4
): how could I proceed?
> dput(df)
structure(list(x1 = c(1090, 1020, 883, 209, 1, 1, 0, 3, 3, 2,
2, 17, 11, 15, 1, 21, 12, 15, 6, 5, 9, 10, 15, 20, 23, 34, 22,
29, 31, 16), y = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), x2 = c(1, 1,
1, 76, 1e+07, 1e+07, 1e+07, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
2, 2, 7, 1, 2, 2, 1, 1, 1, 3, 1, 1), x3 = c(2052.29646840149,
1086.36659663866, 5229.23948598131, 280.329896907216, 9844331,
9844331, 9844331, 14776.3333333333, 14776.3333333333, 2239.33333333333,
2239.33333333333, 52526.25, 104597.666666667, 7341.42857142857,
9844331, 3394.73684210526, 6565.28, 6565.28, 10738.5, 10738.5,
6289, 10253.4, 6948.41379310345, 6948.41379310345, 1946.73076923077,
1946.73076923077, 18460.15, 8886.61538461538, 43386.1153846154,
7513.66666666667), x4 = c("Fr", "Tu", "We", "Su", "Mo", "Mo",
"We", "Su", "Su", "Su", "Su", "Sa", "Fr", "We", "Mo", "Mo", "Su",
"Su", "Sa", "Sa", "Th", "Fr", "Mo", "Mo", "Mo", "Mo", "Sa", "Sa",
"Th", "Sa"), x5 = c(1, 18, 22, 16, 3, 3, 15, 19, 19, 21, 21,
15, 16, 5, 7, 7, 11, 11, 9, 9, 19, 16, 0, 0, 0, 0, 13, 3, 17,
7), x6 = c(147, 139, 139, 139, 134, 126, 95, 95, 95, 147, 139,
138, 138, 138, 138, 138, 138, 138, 138, 138, 138, 138, 147, 139,
139, 139, 139, 139, 139, 139), x7 = c(2, 2, 2, 2, 1, 1, 2, 2,
2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3), x8 = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), x9 = c(0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0)), row.names = c(NA, -30L), class = c("tbl_df", "tbl",
"data.frame"))
We could use cor()
function.
To correlate only numeric columns we could use sapply
:
round(cor(df[sapply(df,is.numeric)]),
digits = 2
)
Output:
x1 y x2 x3 x5 x6 x7 x8 x9
x1 1.00 NA -0.13 -0.16 0.11 0.20 -0.36 NA NA
y NA 1 NA NA NA NA NA NA NA
x2 -0.13 NA 1.00 0.85 -0.17 -0.39 -0.72 NA NA
x3 -0.16 NA 0.85 1.00 -0.20 -0.32 -0.57 NA NA
x5 0.11 NA -0.17 -0.20 1.00 -0.29 -0.03 NA NA
x6 0.20 NA -0.39 -0.32 -0.29 1.00 0.44 NA NA
x7 -0.36 NA -0.72 -0.57 -0.03 0.44 1.00 NA NA
x8 NA NA NA NA NA NA NA 1 NA
x9 NA NA NA NA NA NA NA NA 1