Search code examples
rfiltertidyversesubsetbioinformatics

Filter rows in an R tidy data frame based on same-column values that are located in different rows


I am trying to filter out the species (ASV) in samples that have lower abundance counts than the same species in controls. Below is a simplified version of my tidy data table:

ASV ID Type Batch Counts
ASV1 Control23 Con1 B1 5000
ASV2 Control23 Con1 B1 2000
ASV1 Sample21 S B1 4000
ASV2 Sample21 S B1 6000
ASV1 Sample28 S B1 10000
ASV2 Control25 Con25 B2 25000
ASV3 Control25 Con25 B2 9000
ASV2 Sample12 S B2 70000
ASV3 Sample12 S B2 26000
ASV2 Sample10 S B2 3100

For example, I would want to remove ASV1 in Sample21 from the dataset because its counts are less than the Control23 that was used for Batch B1. Another example: I want to keep ASV2 in Sample12 because its counts are higher than the counts of the same species in Control25 in Batch B2.

I've been using tidyverse to clean and group my data but I'm stuck as to how to subset the ASV count value for the control in each batch. I thought it would be simple to use filter as follows:

df_ASV.Tidy.Batch <- df_ASV.Tidy %>% 
   group_by(Batch, Type, ID, ASV) %>% 
   filter((Type != "S" & Counts > 0) > (Type != "C" & Counts > 0)) 

But the output only acknowledged the argument for Type != "S" & Counts > 0 and not Type != "C" & Counts > 0, and the output was the same even without Type != "C" & Counts > 0.

I think I would likely need to create a new column by mutate or summarise to return TRUE or FALSE and then use filter, but it returns back to the original problem where I can't figure out how to specifically compare the count values of control vs sample of the same species and batch.

Any advice would be greatly appreciated. Thank you!


Solution

  • You can indentify the ones to remove using a non-equi join, and then anti-join on those:

    anti_join(
      df_ASV.Tidy, 
      inner_join(
        df_ASV.Tidy %>% filter(substr(ID,1,1)=="S"),
        df_ASV.Tidy %>% filter(substr(ID,1,1)=="C"),
        join_by(ASV, Batch, x$Counts<y$Counts)
      ) %>% distinct(ASV, Batch, ID =ID.x)
    )
    

    Output:

       ASV        ID  Type Batch Counts
    1 ASV1 Control23  Con1    B1   5000
    2 ASV2 Control23  Con1    B1   2000
    3 ASV2  Sample21     S    B1   6000
    4 ASV1  Sample28     S    B1  10000
    5 ASV2 Control25 Con25    B2  25000
    6 ASV3 Control25 Con25    B2   9000
    7 ASV2  Sample12     S    B2  70000
    8 ASV3  Sample12     S    B2  26000
    

    This removes the following:

    • ASV1, Sample21 (because less than Control 23 in the same batch/species
    • ASV2, Sample10 (because less than Control 25 in the same batch/species

    An alternative approach is to identify the samples you want to retain using the non-equi join, and then row bind with the controls

    bind_rows(
      inner_join(
        df_ASV.Tidy %>% filter(substr(ID,1,1)=="S"),
        df_ASV.Tidy %>% filter(substr(ID,1,1)=="C"),
        join_by(ASV, Batch, x$Counts>y$Counts)
      ) %>% select(ASV, ID=ID.x, Type=Type.x, Batch, Counts = Counts.x),
      df_ASV.Tidy %>% filter(substr(ID,1,1) == "C")
    )
    

    Result is equivalent.