I am trying to filter out the species (ASV) in samples that have lower abundance counts than the same species in controls. Below is a simplified version of my tidy data table:
ASV | ID | Type | Batch | Counts |
---|---|---|---|---|
ASV1 | Control23 | Con1 | B1 | 5000 |
ASV2 | Control23 | Con1 | B1 | 2000 |
ASV1 | Sample21 | S | B1 | 4000 |
ASV2 | Sample21 | S | B1 | 6000 |
ASV1 | Sample28 | S | B1 | 10000 |
ASV2 | Control25 | Con25 | B2 | 25000 |
ASV3 | Control25 | Con25 | B2 | 9000 |
ASV2 | Sample12 | S | B2 | 70000 |
ASV3 | Sample12 | S | B2 | 26000 |
ASV2 | Sample10 | S | B2 | 3100 |
For example, I would want to remove ASV1 in Sample21 from the dataset because its counts are less than the Control23 that was used for Batch B1. Another example: I want to keep ASV2 in Sample12 because its counts are higher than the counts of the same species in Control25 in Batch B2.
I've been using tidyverse
to clean and group my data but I'm stuck as to how to subset the ASV count value for the control in each batch. I thought it would be simple to use filter
as follows:
df_ASV.Tidy.Batch <- df_ASV.Tidy %>%
group_by(Batch, Type, ID, ASV) %>%
filter((Type != "S" & Counts > 0) > (Type != "C" & Counts > 0))
But the output only acknowledged the argument for Type != "S" & Counts > 0
and not Type != "C"
& Counts > 0, and the output was the same even without Type != "C" & Counts > 0
.
I think I would likely need to create a new column by mutate
or summarise
to return TRUE or FALSE and then use filter
, but it returns back to the original problem where I can't figure out how to specifically compare the count values of control vs sample of the same species and batch.
Any advice would be greatly appreciated. Thank you!
You can indentify the ones to remove using a non-equi join, and then anti-join on those:
anti_join(
df_ASV.Tidy,
inner_join(
df_ASV.Tidy %>% filter(substr(ID,1,1)=="S"),
df_ASV.Tidy %>% filter(substr(ID,1,1)=="C"),
join_by(ASV, Batch, x$Counts<y$Counts)
) %>% distinct(ASV, Batch, ID =ID.x)
)
Output:
ASV ID Type Batch Counts
1 ASV1 Control23 Con1 B1 5000
2 ASV2 Control23 Con1 B1 2000
3 ASV2 Sample21 S B1 6000
4 ASV1 Sample28 S B1 10000
5 ASV2 Control25 Con25 B2 25000
6 ASV3 Control25 Con25 B2 9000
7 ASV2 Sample12 S B2 70000
8 ASV3 Sample12 S B2 26000
This removes the following:
An alternative approach is to identify the samples you want to retain using the non-equi join, and then row bind with the controls
bind_rows(
inner_join(
df_ASV.Tidy %>% filter(substr(ID,1,1)=="S"),
df_ASV.Tidy %>% filter(substr(ID,1,1)=="C"),
join_by(ASV, Batch, x$Counts>y$Counts)
) %>% select(ASV, ID=ID.x, Type=Type.x, Batch, Counts = Counts.x),
df_ASV.Tidy %>% filter(substr(ID,1,1) == "C")
)
Result is equivalent.