Search code examples
rdataframedplyrplyrsummarize

Conditional summing across columns with dplyr


I have a data frame with four habitats sampled over eight months. Ten samples were collected from each habitat each month. The number of individuals for species in each sample was counted. The following code generates a smaller data frame of a similar structure.

# Pseudo data
Habitat <- factor(c(rep("Dry",6), rep("Wet",6)), levels = c("Dry","Wet"))
Month <- factor(rep(c(rep("Jan",2), rep("Feb",2), rep("Mar",2)),2), levels=c("Jan","Feb","Mar"))
Sample <- rep(c(1,2),6)
Species1 <- rpois(12,6)
Species2 <- rpois(12,6)
Species3 <- rpois(12,6)

df <- data.frame(Habitat,Month, Sample, Species1, Species2, Species3)

I want to sum the total number of individuals by month, across all species sampled. I'm using ddply (preferred) but I'm open to other suggestions.

The closest I get is to add together the sum of each column, as shown here.

library(plyr)
ddply(df, ~ Month, summarize, tot_by_mon = sum(Species1) + sum(Species2) + sum(Species3))

#   Month tot_by_mon
# 1   Jan         84
# 2   Feb         92
# 3   Mar         67

This works, but I wonder if there is a generic method to handle cases with an "unknown" number of species. That is, the first species always begins in the 4th column but the last species could be in the 10th or 42nd column. I do not want to hard code the actual species names into the summary function. Note that the species names vary widely, such as Doryflav and Pheibica.


Solution

  • Similar to @useR's answer with data.table's melt, you can use tidyr to reshape with gather:

    library(tidyr)
    library(dplyr)
    gather(df, Species, Value, matches("Species")) %>% 
      group_by(Month) %>% summarise(z = sum(Value))
    
    # A tibble: 3 x 2
       Month     z
      <fctr> <int>
    1    Jan    90
    2    Feb    81
    3    Mar    70
    

    If you know the columns by position instead of a pattern to be "matched"...

    gather(df, Species, Value, -(1:3)) %>% 
      group_by(Month) %>% summarise(z = sum(Value))
    

    (Results shown using @akrun's set.seed(123) example data.)