Working in R on genomic data.
I'm trying to subset a very large melted phyloseq table, which includes a column of phylum IDs, in order to remove rows containing phyla that occur less than 100000 times in the table. I might have missed an "easy" way to do this, but I eventually ended up trying to make my own function.
The function:
phylum_subset <- function(x = melt.ALKSS_few, #melted physeq object
Count = melt.ALKSS$Phylum, #counting phyla
Value = 1000 #minimum number of OTUs
){
phyla.table <- table(x$Count)
for(Count in x){if(phyla.table[Count]<=100000)
subset(x,Phylum != Count)
}
}
I will grant that this is my first time writing a function and I don't really know what I'm doing.
My function input and resulting error output ends up like so:
melt.ALKSS_few.count <- phylum_subset(x = melt.ALKSS_few,Count = melt.ALKSS_few$Phylum,Value = 100000)
Error in if (phyla.table[Count] <= 1e+05) subset(x, Count != Phylum_col) :
the condition has length > 1
Because I'm trying to subset by a sum of occurences in a column, across all occurences in that column, I couldn't just use filter() or something once (unless I wanted to do that 500 times). Surely someone has done something like this before?
Edit: OK, trying to provide a reproducible chunk of my dataset. Be warned, it's got over 808k obs of 47 variables because doing genomics on an ecological dataset is a mess. I've removed some variables that are remnants of metadata for previous steps (primer sequences, etc.) that I won't be using in analysis just to keep the code block... less massive.
> dput(droplevels(head(melt.ALKSS_few)))
structure(list(OTU = c("44c21e29adae97a53247abbd73978395", "0f18144d308ada95632ab5193d92073f",
"d829bee4984f82ffc2453212157caf96", "0f18144d308ada95632ab5193d92073f",
"0ddcd311e02f742e2e0e61ce02cf9c29", "120eba657e42a11a5c29f97b90f02035"
), Sample = c("S438", "S680", "S437", "S345", "S454", "S513"),
Abundance = c(10755, 9568, 8186, 7621, 7506, 7501), BarcodeSequence = c("CATTTTAGGACT",
"CGGAATAGAGTA", "CATTTTAGAGTA", "TATAATGGACCA", "CGGAATTGGCAT",
"GACGACGGACCA"), PrimerDesc = c("16S",
"16S", "16S", "16S", "16S", "16S"), SampleName = c("06222021KC-2-R",
"09292021KC-2-R", "06222021KC-1-R", "06032021KC-1-R", "06292921KC-3-R",
"06302021KC-3-R"), Project = c("16SLBSKR1-", "16SLBSKR2-",
"16SLBSKR1-", "16SLBSKR1-", "16SLBSKR1-", "16SLBSKR2-"),
Number = c("456", "694", "455", "363", "471", "491"), Date = c("6_22_2021", "9_29_2021", "6_22_2021", "6_3_2021",
"6_29_2021", "6_30_2021"), Year = c(2021L, 2021L, 2021L,
2021L, 2021L, 2021L), Season = c("Summer", "Fall", "Summer",
"Summer", "Summer", "Summer"), sample_Species = c("Little_Bluestem",
"Little_Bluestem", "Little_Bluestem", "Little_Bluestem",
"Little_Bluestem", "Little_Bluestem"), SoloOrMixed = c("Solo",
"Mixed", "Solo", "Mixed", "Mixed", "Solo"), Location = c("Tyler_SP", "Hy_180", "Tyler_SP", "Roadside_Hy67",
"Copper_Breaks_SP", "Caprock_Canyons_SP"), Ecoregion = c("South_Central_Plains",
"South_Central_Plains", "South_Central_Plains", "Edwards_Plateau",
"Southwestern_Tablelands", "Southwestern_Tablelands"), Habitat = c("Forest",
"Roadside", "Forest", "Roadside", "Roadside", "AridRock"),
Source = c("Root", "Root", "Root", "Root", "Root", "Root"
), PrecipMonth = c(96.65,
37.45, 96.65, 125.94, 125.01, 153.94), PrecipDaysSince = c(1L,
1L, 1L, 1L, 1L, 0L), pH = c(6.8, 6.7, 6.8, 8, 8, 7.8), EC = c(139L,
182L, 139L, 161L, 125L, 2370L), NO3 = c(0, 4.4, 0, 0.2, 2.2,
3.4), P = c(16L, 17L, 16L, 14L, 5L, 6L), K = c(145L, 84L,
145L, 114L, 160L, 65L), Ca = c(3918L, 2159L, 3918L, 27256L,
6609L, 16508L), Mg = c(166L, 130L, 166L, 188L, 148L, 95L),
S = c(10L, 16L, 10L, 24L, 24L, 14299L), Na = c(4L, 3L, 4L,
4L, 4L, 4L), Fe = c(19.76, 17, 19.76, 2.31, 1, 0), Zn = c(2.28,
15.1, 2.28, 7.01, 0.8, 0.1), Mn = c(64.16, 19, 64.16, 27.01,
15, 6), Cu = c(0.16, 0.2, 0.16, 0.16, 0.2, 0.2), Kingdom = c("d__Bacteria",
"d__Bacteria", "d__Bacteria", "d__Bacteria", "d__Bacteria",
"d__Bacteria"), Phylum = c("Proteobacteria", "Proteobacteria",
"Proteobacteria", "Proteobacteria", "Proteobacteria", "Actinobacteriota"
), Class = c("Gammaproteobacteria", "Gammaproteobacteria",
"Alphaproteobacteria", "Gammaproteobacteria", "Gammaproteobacteria",
"Actinobacteria"), Order = c("Xanthomonadales", "Pseudomonadales",
"Rhizobiales", "Pseudomonadales", "Pseudomonadales", "Streptomycetales"
), Family = c("Rhodanobacteraceae", "Pseudomonadaceae", "Xanthobacteraceae",
"Pseudomonadaceae", "Pseudomonadaceae", "Streptomycetaceae"
), Genus = c("Rhodanobacter", "Pseudomonas", "Bradyrhizobium",
"Pseudomonas", "Pseudomonas", "Streptomyces"), Species = c(NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_)), row.names = c(2352002L, 511171L, 7348565L,
510815L, 468295L, 621043L), class = "data.frame")
See below for a tidyverse
solution. Note that I've used GlobalPatterns
from phyloseq
to create a reproducible example.
require("phyloseq")
require("tidyverse")
# Load the data and melt it
data(GlobalPatterns)
psdf <- psmelt(GlobalPatterns)
# Function to subset a dataframe based on the size of each group
# in a grouping variable
subset_by_freq <- function(df, grouping_var, threshold){
df %>%
group_by(!!sym(grouping_var)) %>%
filter(n() >= threshold) %>%
ungroup()
}
# Filter out taxa with less than 1e5 counts
psdf_sub <- subset_by_freq(psdf, "Phylum", 1e5)
# Sanity check: count the number of rows per taxon
psdf_sub %>%
group_by(Phylum) %>%
tally()
#> # A tibble: 2 x 2
#> Phylum n
#> <chr> <int>
#> 1 Firmicutes 113256
#> 2 Proteobacteria 166816
Created on 2022-08-12 by the reprex package (v2.0.1)