I have a dataframe that represents the results of Pearson's correlation. This is a sample of the structure:
Row | Column | cor
Event |Event | 1
mean |Event | .82
mean |kurtosis| .30
mean |entropy | .85
entropy |Event | .71
entropy |kurtois | .25
kurtosis|Event | .69
I need to filter the correlations so if the correlation between two features was larger than the 0.80 (condition 1) threshold, only the variable with the highest association with "Event" is selected (condition 2). I am hoping the end product will look like this:
Row | Column | cor
mean |Event | .82
In the example above, mean and entropy are correlated above the threshold, however, "mean" has the higher correlation with "Event" So that is the final output. I am using biological data so I have 100s of features and its too much to do manually.
We can do
library(dplyr)
df1 %>%
filter(Row != Column, cor > 0.80, Column == 'Event')
-output
# Row Column cor
#1 mean Event 0.82
Or use data.table
library(data.table)
setDT(df1)[Row!= column & cor > 0.8 & Column == 'Event']
df1 <- structure(list(Row = c("Event", "mean", "mean", "mean", "entropy",
"entropy", "kurtosis"), Column = c("Event", "Event", "kurtosis",
"entropy", "Event", "kurtois", "Event"), cor = c(1, 0.82, 0.3,
0.85, 0.71, 0.25, 0.69)), class = "data.frame", row.names = c(NA,
-7L))