I have a dataset with multiple samples (columns) and variables (rows). I want to filter out a dataset to determine variables that are unique to a particular set of samples.
This is the sample data frame
dput(df)
structure(list(Description=c("k__Bacteria;__;__;__;__","k__Bacteria;p__Acidobacteria;c__[Chloracidobacteria];o__RB41;f__Ellin6075",
"k__Bacteria;p__Acidobacteria;c__Acidobacteriia;o__Acidobacteriales;f__Koribacteraceae",
"k__Bacteria;p__Acidobacteria;c__DA052;o__Ellin6513;f__", "k__Bacteria;p__Acidobacteria;c__Solibacteres;o__Solibacterales;f__",
"k__Bacteria;p__Actinobacteria;c__Acidimicrobiia;o__Acidimicrobiales;f__",
"k__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Actinomycetales;f__",
"k__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Actinomycetales;f__Actinomycetaceae",
"k__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Actinomycetales;f__Actinopolysporaceae",
"k__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Actinomycetales;f__Corynebacteriaceae"
), ADZU.3 = c(2651L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 12L), ADZU.4 = c(2439L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 5L), BEP.3 = c(11452L, 9L, 5L,
0L, 0L, 6L, 14L, 0L, 0L, 83L), BEP.4 = c(4168L, 0L, 0L, 9L, 3L,
0L, 0L, 5L, 6L, 61L), Hya.1 = c(15179L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 94L), Hya.2 = c(4525L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
34L)), row.names = c(NA, 10L), class = "data.frame")
I am using the filter_at() function in dyplr, and have a code that works as intended. Below, I have many samples starting with different letters A, B, H, etc. I want to find variables that are unique to samples that start with the same letter (for example, letter B).
I have a code that currently works well
##code set 1, this code works
df.bep<-filter_at(df,vars(starts_with("A"),starts_with("H")),
all_vars(.==0))
The result of this code is the following, which is what I expect to see:
dput(df.bep)
structure(list(Description = c("k__Bacteria;p__Acidobacteria;c__[Chloracidobacteria];o__RB41;f__Ellin6075",
"k__Bacteria;p__Acidobacteria;c__Acidobacteriia;o__Acidobacteriales;f__Koribacteraceae",
"k__Bacteria;p__Acidobacteria;c__DA052;o__Ellin6513;f__", "k__Bacteria;p__Acidobacteria;c__Solibacteres;o__Solibacterales;f__",
"k__Bacteria;p__Actinobacteria;c__Acidimicrobiia;o__Acidimicrobiales;f__",
"k__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Actinomycetales;f__",
"k__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Actinomycetales;f__Actinomycetaceae",
"k__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Actinomycetales;f__Actinopolysporaceae"
), ADZU.3 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), ADZU.4 = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L), BEP.3 = c(9L, 5L, 0L, 0L, 6L, 14L,
0L, 0L), BEP.4 = c(0L, 0L, 9L, 3L, 0L, 0L, 5L, 6L), Hya.1 = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L), Hya.2 = c(0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L)), row.names = c(NA, -8L), class = "data.frame")
This issue is that for longer datasets with many different samples, specifying every letter for every sample I want to filter_at() starts to get cumbersome to write.
I modified the script to use -starts_with()
to try to filter the data frame by excluding samples that start with a specific letter I don't want to filter (for example filter all samples except those that start with letter B), such as:
df.bep.2<-filter_at(df,vars(-starts_with("B")),all_vars(.==0))
However, this second set of code doesn't work as intended. I do not get any errors, but instead I get an empty data frame
dput(df.bep.2)
structure(list(Description = character(0), ADZU.3 = integer(0),
ADZU.4 = integer(0), BEP.3 = integer(0), BEP.4 = integer(0),
Hya.1 = integer(0), Hya.2 = integer(0)), row.names = c(NA,
0L), class = "data.frame")
is there something additional I need to put in the code when combining filter_at() and -starts_with()?
That means your condition in all_vars
is not met in columns that do not start with "A"
. That filter is searching all columns that don't start with A and only selecting rows that contain all 0
's.
For example, mtcars
dataset will not return anything with this condition:
mtcars %>%
filter_at(vars(-starts_with("q")), all_vars(. == 0))
[1] mpg cyl disp hp drat wt qsec vs am gear carb
<0 rows> (or 0-length row.names)
Unless, we add a row with only 0
's (although we could have a non-zero for the qsec
column):
mtcars %>%
bind_rows(setNames(rep(0, ncol(.)), names(.))) %>%
filter_at(vars(-starts_with("q")), all_vars(. == 0))
mpg cyl disp hp drat wt qsec vs am gear carb
1 0 0 0 0 0 0 0 0 0 0 0
EDIT: for your specific problem, it is because the column Description
does not == 0
. There are probably a couple solutions, but here are two below that should work for you!
df1 %>%
filter_at(vars(-starts_with("B"), -one_of("Description")), all_vars(. == 0))
df1 %>%
filter_if(sapply(., is.numeric) & !startsWith(names(.), "B"), all_vars(. == 0))