Search code examples
rdplyrdata-management

select columns from dataframe where groups of samples are nonzero


I have a sample (rows) by species (columns) dataframe. And a column in another dataframe that codes the samples into groups. I want to select all of the columns where all of the samples in any of the groups have a nonzero value.

species frame:

structure(list(Otu000132 = c(0L, 56L, 30L, 52L, 1L, 4L, 31L, 4L, 17L, 9L, 4L), 
               Otu000144 = c(191L, 14L, 58L, 137L, 127L, 222L, 26L, 175L, 133L, 107L, 43L),
               Otu000146 = c(0L, 0L, 0L, 0L, 16L, 62L, 41L, 16L, 60L, 32L, 0L), 
               Otu000147 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), 
               Otu000151 = c(2L, 9L, 4L, 1L, 0L, 4L, 4L, 2L, 3L, 0L, 0L),
               Otu000162 = c(2L, 1L, 0L, 0L, 1L, 1L, 0L, 2L, 1L, 0L, 0L), 
               Otu000164 = c(2L, 0L, 1L, 2L, 0L, 0L, 0L, 0L, 0L, 0L, 0L),
               Otu000174 = c(0L, 0L, 3L, 1L, 0L, 2L, 0L, 1L, 2L, 1L, 0L), 
               Otu000176 = c(1L, 9L, 0L, 1L, 2L, 5L, 3L, 3L, 8L, 2L, 2L), 
               Otu000186 = c(1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L),
               Otu000190 = c(1L, 1L, 1L, 0L, 0L, 5L, 1L, 2L, 7L, 0L, 0L)),
          .Names = c("Otu000132", "Otu000144", "Otu000146", "Otu000147", 
                     "Otu000151", "Otu000162", "Otu000164", "Otu000174", 
                     "Otu000176", "Otu000186", "Otu000190"),
          row.names = 30:40, class = "data.frame")

grouping frame:

structure(c(30, 31, 32, 33, 34, 35, 36, 37, 38, 39,
            40, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3), 
          .Dim = c(11L, 2L))

desired output:

structure(list(Otu000132 = c(0L, 56L, 30L, 52L, 1L, 4L, 31L, 4L, 17L, 9L, 4L), 
               Otu000144 = c(191L, 14L, 58L, 137L, 127L, 222L, 26L, 175L, 133L, 107L, 43L), 
               Otu000151 = c(2L, 9L, 4L, 1L, 0L, 4L, 4L, 2L, 3L, 0L, 0L), 
               Otu000176 = c(1L, 9L, 0L, 1L, 2L, 5L, 3L, 3L, 8L, 2L, 2L),
               Otu000190 = c(1L, 1L, 1L, 0L, 0L, 5L, 1L, 2L, 7L, 0L, 0L)), 
          .Names = c("Otu000132", "Otu000144",  "Otu000151", 
                     "Otu000176", "Otu000190"),
          row.names = 30:40, class = "data.frame")

I feel like this should be something that I could do with dplyr select, but I can't figure it out. Anyone have suggestions for starting me on a path?


Solution

  • This can indeed be done with dplyr, and in a fairly straightforward way. As others have pointed out, "Otu000146" does not meet your described criteria and would not be included in the final column selection.

    library(dplyr)
    library(tidyr)
    
    df.species <- cbind(species, group = grouping[,2]) %>% # merge the grouping variable into the main data set
        gather(variable, value, -group) %>%  # gather the columns into 'long' format
        group_by(variable, group) %>% # group by column name and group
        summarize(keep = all(value != 0)) %>% # variables and groups where all values are non-zero
        ungroup %>% group_by(variable) %>%  # reset grouping
        summarize(keep = any(keep)) %>%  # variables where at least 1 group met the aforementioned criterion
        dplyr::filter(keep) # final list
    
       variable  keep
          <chr> <lgl>
    1 Otu000132  TRUE
    2 Otu000144  TRUE
    3 Otu000151  TRUE
    4 Otu000176  TRUE
    5 Otu000190  TRUE
    
    # retrieve only the matching columns
    df.desired <- species[df.species$variable]
    
       Otu000132 Otu000144 Otu000151 Otu000176 Otu000190
    30         0       191         2         1         1
    31        56        14         9         9         1
    32        30        58         4         0         1
    33        52       137         1         1         0
    34         1       127         0         2         0
    35         4       222         4         5         5
    36        31        26         4         3         1
    37         4       175         2         3         2
    38        17       133         3         8         7
    39         9       107         0         2         0
    40         4        43         0         2         0