Search code examples
rfilterconditional-statementssequential

Is there a way in R to filter a data frame by two sequential conditions?


I have a dataframe that represents the results of Pearson's correlation. This is a sample of the structure:

Row     | Column | cor
Event   |Event   | 1
mean    |Event   | .82
mean    |kurtosis| .30
mean    |entropy | .85
entropy |Event   | .71
entropy |kurtois | .25
kurtosis|Event   | .69

I need to filter the correlations so if the correlation between two features was larger than the 0.80 (condition 1) threshold, only the variable with the highest association with "Event" is selected (condition 2). I am hoping the end product will look like this:

Row     | Column | cor
mean    |Event   | .82

In the example above, mean and entropy are correlated above the threshold, however, "mean" has the higher correlation with "Event" So that is the final output. I am using biological data so I have 100s of features and its too much to do manually.


Solution

  • We can do

    library(dplyr)
    df1 %>%
         filter(Row  != Column, cor > 0.80, Column == 'Event')
    

    -output

    #   Row Column  cor
    #1 mean  Event 0.82
    

    Or use data.table

    library(data.table)
    setDT(df1)[Row!= column & cor > 0.8 & Column == 'Event']
    

    data

    df1 <- structure(list(Row = c("Event", "mean", "mean", "mean", "entropy", 
    "entropy", "kurtosis"), Column = c("Event", "Event", "kurtosis", 
    "entropy", "Event", "kurtois", "Event"), cor = c(1, 0.82, 0.3, 
    0.85, 0.71, 0.25, 0.69)), class = "data.frame", row.names = c(NA, 
    -7L))