I have the 52K row dataframe. I want to drop all genes that dont have both Light and Healthy in the group column. I would like to filter these out. I am not really sure how to do this quickly. I was thinking that tidyverse or dplyr might be useful.
data
gene id group snp ref total ref_condition
11080 ZZZ3 Healthy Healthy chr1:77664558 1 5 Healthy
22772 ZZZ3 Healthy Healthy chr1:77557488 2 5 Healthy
1632 ZZEF1 Healthy Healthy chr17:4086375 4 7 Healthy
13357 ZZEF1 Healthy Healthy chr17:4033235 7 9 Healthy
15312 ZYG11B Healthy Healthy chr1:52769202 1 2 Healthy
145341 ZYG11B Light Light chr1:52779185 1 4 Healthy
Wanted output
gene id group snp ref total ref_condition
15312 ZYG11B Healthy Healthy chr1:52769202 1 2 Healthy
145341 ZYG11B Light Light chr1:52779185 1 4 Healthy
You could use two any
s per group_by
like this:
library(dplyr)
data %>%
group_by(gene) %>%
filter(any(group == "Healthy") & any(group == "Light"))
#> # A tibble: 2 × 7
#> # Groups: gene [1]
#> gene id group snp ref total ref_condition
#> <chr> <chr> <chr> <chr> <int> <int> <chr>
#> 1 ZYG11B Healthy Healthy chr1:52769202 1 2 Healthy
#> 2 ZYG11B Light Light chr1:52779185 1 4 Healthy
Created on 2023-01-23 with reprex v2.0.2