Search code examples
rdplyrfilteringsubset

Drop rows of data if two conditions don't exist in a column in R


I have the 52K row dataframe. I want to drop all genes that dont have both Light and Healthy in the group column. I would like to filter these out. I am not really sure how to do this quickly. I was thinking that tidyverse or dplyr might be useful.

data
         gene      id   group           snp ref total ref_condition
11080    ZZZ3 Healthy Healthy chr1:77664558   1     5       Healthy
22772    ZZZ3 Healthy Healthy chr1:77557488   2     5       Healthy
1632    ZZEF1 Healthy Healthy chr17:4086375   4     7       Healthy
13357   ZZEF1 Healthy Healthy chr17:4033235   7     9       Healthy
15312  ZYG11B Healthy Healthy chr1:52769202   1     2       Healthy
145341 ZYG11B   Light   Light chr1:52779185   1     4       Healthy

Wanted output
             gene      id   group           snp ref total ref_condition
    15312  ZYG11B Healthy Healthy chr1:52769202   1     2       Healthy
    145341 ZYG11B   Light   Light chr1:52779185   1     4       Healthy

Solution

  • You could use two anys per group_by like this:

    library(dplyr)
    data %>%
      group_by(gene) %>%
      filter(any(group == "Healthy") & any(group == "Light"))
    #> # A tibble: 2 × 7
    #> # Groups:   gene [1]
    #>   gene   id      group   snp             ref total ref_condition
    #>   <chr>  <chr>   <chr>   <chr>         <int> <int> <chr>        
    #> 1 ZYG11B Healthy Healthy chr1:52769202     1     2 Healthy      
    #> 2 ZYG11B Light   Light   chr1:52779185     1     4 Healthy
    

    Created on 2023-01-23 with reprex v2.0.2