Trying to subset a large table using counts of all row values in a single column

Working in R on genomic data.

I'm trying to subset a very large melted phyloseq table, which includes a column of phylum IDs, in order to remove rows containing phyla that occur less than 100000 times in the table. I might have missed an "easy" way to do this, but I eventually ended up trying to make my own function.

The function:

phylum_subset <- function(x = melt.ALKSS_few, #melted physeq object 
                          Count = melt.ALKSS$Phylum, #counting phyla
                          Value = 1000   #minimum number of OTUs
                          ){
  phyla.table <- table(x$Count) 
  
  for(Count in x){if(phyla.table[Count]<=100000)
                        subset(x,Phylum != Count)
  }
}

I will grant that this is my first time writing a function and I don't really know what I'm doing.

My function input and resulting error output ends up like so:

melt.ALKSS_few.count <- phylum_subset(x = melt.ALKSS_few,Count = melt.ALKSS_few$Phylum,Value = 100000)
Error in if (phyla.table[Count] <= 1e+05) subset(x, Count != Phylum_col) : 
  the condition has length > 1

Because I'm trying to subset by a sum of occurences in a column, across all occurences in that column, I couldn't just use filter() or something once (unless I wanted to do that 500 times). Surely someone has done something like this before?

Edit: OK, trying to provide a reproducible chunk of my dataset. Be warned, it's got over 808k obs of 47 variables because doing genomics on an ecological dataset is a mess. I've removed some variables that are remnants of metadata for previous steps (primer sequences, etc.) that I won't be using in analysis just to keep the code block... less massive.

> dput(droplevels(head(melt.ALKSS_few)))
structure(list(OTU = c("44c21e29adae97a53247abbd73978395", "0f18144d308ada95632ab5193d92073f", 
"d829bee4984f82ffc2453212157caf96", "0f18144d308ada95632ab5193d92073f", 
"0ddcd311e02f742e2e0e61ce02cf9c29", "120eba657e42a11a5c29f97b90f02035"
), Sample = c("S438", "S680", "S437", "S345", "S454", "S513"), 
    Abundance = c(10755, 9568, 8186, 7621, 7506, 7501), BarcodeSequence = c("CATTTTAGGACT", 
    "CGGAATAGAGTA", "CATTTTAGAGTA", "TATAATGGACCA", "CGGAATTGGCAT", 
    "GACGACGGACCA"), PrimerDesc = c("16S", 
    "16S", "16S", "16S", "16S", "16S"), SampleName = c("06222021KC-2-R", 
    "09292021KC-2-R", "06222021KC-1-R", "06032021KC-1-R", "06292921KC-3-R", 
    "06302021KC-3-R"), Project = c("16SLBSKR1-", "16SLBSKR2-", 
    "16SLBSKR1-", "16SLBSKR1-", "16SLBSKR1-", "16SLBSKR2-"), 
    Number = c("456", "694", "455", "363", "471", "491"), Date = c("6_22_2021", "9_29_2021", "6_22_2021", "6_3_2021", 
    "6_29_2021", "6_30_2021"), Year = c(2021L, 2021L, 2021L, 
    2021L, 2021L, 2021L), Season = c("Summer", "Fall", "Summer", 
    "Summer", "Summer", "Summer"), sample_Species = c("Little_Bluestem", 
    "Little_Bluestem", "Little_Bluestem", "Little_Bluestem", 
    "Little_Bluestem", "Little_Bluestem"), SoloOrMixed = c("Solo", 
    "Mixed", "Solo", "Mixed", "Mixed", "Solo"), Location = c("Tyler_SP", "Hy_180", "Tyler_SP", "Roadside_Hy67", 
    "Copper_Breaks_SP", "Caprock_Canyons_SP"), Ecoregion = c("South_Central_Plains", 
    "South_Central_Plains", "South_Central_Plains", "Edwards_Plateau", 
    "Southwestern_Tablelands", "Southwestern_Tablelands"), Habitat = c("Forest", 
    "Roadside", "Forest", "Roadside", "Roadside", "AridRock"), 
    Source = c("Root", "Root", "Root", "Root", "Root", "Root"
    ),  PrecipMonth = c(96.65, 
    37.45, 96.65, 125.94, 125.01, 153.94), PrecipDaysSince = c(1L, 
    1L, 1L, 1L, 1L, 0L), pH = c(6.8, 6.7, 6.8, 8, 8, 7.8), EC = c(139L, 
    182L, 139L, 161L, 125L, 2370L), NO3 = c(0, 4.4, 0, 0.2, 2.2, 
    3.4), P = c(16L, 17L, 16L, 14L, 5L, 6L), K = c(145L, 84L, 
    145L, 114L, 160L, 65L), Ca = c(3918L, 2159L, 3918L, 27256L, 
    6609L, 16508L), Mg = c(166L, 130L, 166L, 188L, 148L, 95L), 
    S = c(10L, 16L, 10L, 24L, 24L, 14299L), Na = c(4L, 3L, 4L, 
    4L, 4L, 4L), Fe = c(19.76, 17, 19.76, 2.31, 1, 0), Zn = c(2.28, 
    15.1, 2.28, 7.01, 0.8, 0.1), Mn = c(64.16, 19, 64.16, 27.01, 
    15, 6), Cu = c(0.16, 0.2, 0.16, 0.16, 0.2, 0.2), Kingdom = c("d__Bacteria", 
    "d__Bacteria", "d__Bacteria", "d__Bacteria", "d__Bacteria", 
    "d__Bacteria"), Phylum = c("Proteobacteria", "Proteobacteria", 
    "Proteobacteria", "Proteobacteria", "Proteobacteria", "Actinobacteriota"
    ), Class = c("Gammaproteobacteria", "Gammaproteobacteria", 
    "Alphaproteobacteria", "Gammaproteobacteria", "Gammaproteobacteria", 
    "Actinobacteria"), Order = c("Xanthomonadales", "Pseudomonadales", 
    "Rhizobiales", "Pseudomonadales", "Pseudomonadales", "Streptomycetales"
    ), Family = c("Rhodanobacteraceae", "Pseudomonadaceae", "Xanthobacteraceae", 
    "Pseudomonadaceae", "Pseudomonadaceae", "Streptomycetaceae"
    ), Genus = c("Rhodanobacter", "Pseudomonas", "Bradyrhizobium", 
    "Pseudomonas", "Pseudomonas", "Streptomyces"), Species = c(NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_)), row.names = c(2352002L, 511171L, 7348565L, 
510815L, 468295L, 621043L), class = "data.frame")

Solution

See below for a tidyverse solution. Note that I've used GlobalPatterns from phyloseq to create a reproducible example.

require("phyloseq")
require("tidyverse")

# Load the data and melt it
data(GlobalPatterns)
psdf <- psmelt(GlobalPatterns)

# Function to subset a dataframe based on the size of each group
# in a grouping variable
subset_by_freq <- function(df, grouping_var, threshold){
  df %>%
    group_by(!!sym(grouping_var)) %>%
    filter(n() >= threshold) %>%
    ungroup()
}

# Filter out taxa with less than 1e5 counts
psdf_sub <- subset_by_freq(psdf, "Phylum", 1e5)

# Sanity check: count the number of rows per taxon
psdf_sub %>%
  group_by(Phylum) %>%
  tally()
#> # A tibble: 2 x 2
#>   Phylum              n
#>   <chr>           <int>
#> 1 Firmicutes     113256
#> 2 Proteobacteria 166816

^{Created on 2022-08-12 by the reprex package (v2.0.1)}