r filter tidyverse subset bioinformatics

Filter rows in an R tidy data frame based on same-column values that are located in different rows

I am trying to filter out the species (ASV) in samples that have lower abundance counts than the same species in controls. Below is a simplified version of my tidy data table:

ASV	ID	Type	Batch	Counts
ASV1	Control23	Con1	B1	5000
ASV2	Control23	Con1	B1	2000
ASV1	Sample21	S	B1	4000
ASV2	Sample21	S	B1	6000
ASV1	Sample28	S	B1	10000
ASV2	Control25	Con25	B2	25000
ASV3	Control25	Con25	B2	9000
ASV2	Sample12	S	B2	70000
ASV3	Sample12	S	B2	26000
ASV2	Sample10	S	B2	3100

For example, I would want to remove ASV1 in Sample21 from the dataset because its counts are less than the Control23 that was used for Batch B1. Another example: I want to keep ASV2 in Sample12 because its counts are higher than the counts of the same species in Control25 in Batch B2.

I've been using tidyverse to clean and group my data but I'm stuck as to how to subset the ASV count value for the control in each batch. I thought it would be simple to use filter as follows:

df_ASV.Tidy.Batch <- df_ASV.Tidy %>% 
   group_by(Batch, Type, ID, ASV) %>% 
   filter((Type != "S" & Counts > 0) > (Type != "C" & Counts > 0))

But the output only acknowledged the argument for Type != "S" & Counts > 0 and not Type != "C" & Counts > 0, and the output was the same even without Type != "C" & Counts > 0.

I think I would likely need to create a new column by mutate or summarise to return TRUE or FALSE and then use filter, but it returns back to the original problem where I can't figure out how to specifically compare the count values of control vs sample of the same species and batch.

Any advice would be greatly appreciated. Thank you!

Solution

You can indentify the ones to remove using a non-equi join, and then anti-join on those:

anti_join(
  df_ASV.Tidy, 
  inner_join(
    df_ASV.Tidy %>% filter(substr(ID,1,1)=="S"),
    df_ASV.Tidy %>% filter(substr(ID,1,1)=="C"),
    join_by(ASV, Batch, x$Counts<y$Counts)
  ) %>% distinct(ASV, Batch, ID =ID.x)
)

Output:

   ASV        ID  Type Batch Counts
1 ASV1 Control23  Con1    B1   5000
2 ASV2 Control23  Con1    B1   2000
3 ASV2  Sample21     S    B1   6000
4 ASV1  Sample28     S    B1  10000
5 ASV2 Control25 Con25    B2  25000
6 ASV3 Control25 Con25    B2   9000
7 ASV2  Sample12     S    B2  70000
8 ASV3  Sample12     S    B2  26000

This removes the following:

ASV1, Sample21 (because less than Control 23 in the same batch/species
ASV2, Sample10 (because less than Control 25 in the same batch/species

An alternative approach is to identify the samples you want to retain using the non-equi join, and then row bind with the controls

bind_rows(
  inner_join(
    df_ASV.Tidy %>% filter(substr(ID,1,1)=="S"),
    df_ASV.Tidy %>% filter(substr(ID,1,1)=="C"),
    join_by(ASV, Batch, x$Counts>y$Counts)
  ) %>% select(ASV, ID=ID.x, Type=Type.x, Batch, Counts = Counts.x),
  df_ASV.Tidy %>% filter(substr(ID,1,1) == "C")
)

Result is equivalent.