Search code examples
rdplyrdistinct-valuescumulative-sum

Using dplyr to summarise a running total of distinct factors


I'm trying to generate a species saturation curve for a camera trapping survey. I have thousands of observations and do most of my manipulations in dplyr.

I have three field sites, with observation records of different animal species from a number of weeks of trapping. In some weeks there are no animals, in other weeks there may be more than one species. I want to generate a separate figure for each site to compare how quickly new species that are encountered over the sequential weeks of the study. These observations of new species should eventually saturate once the total species diversity has been captured in the area. Some field sites are likely to saturate faster than others.

The problem is that I have not come across a way of counting the number of distinct species to provide a running total by time. A simple dummy dataset is below.

field_site<-c(rep("A",4),rep("B",4),rep("C",4))
week<-c(1,2,2,3,2,3,4,4,1,2,3,4)
animal<-c("dog","dog","cat","rabbit","dog","dog","dog","rabbit","cat","cat","rabbit","dog")
df<-as.data.frame(cbind(field_site,week,animal),head=TRUE)

I can easily generate the number of unique species within each week grouping, e.g.

tbl_df(df)%>%
  group_by(field_site,week) %>%
  summarise(no_of_sp=n_distinct(animal))

But this is not sensitive to the fact that some species are encountered again in subsequent weeks. What I really need is a running count of the different species that counts the unique species per site from week 1 going down through the rows, assuming that the data is sorted by increasing time from the start of the survey.

The cumulative total of species encountered over the course of the study by week in the example for field Site A would be: week 1 = 1 species, week 2 = 2 species, week 3 = 3 species, week 4 = still 3 species.

For site B cumulative total of species would be: week 1 = 0 species, week 2 = 1 species, week 3 = 1 species,week 4 = 1 species, etc...

Any advice would be greatly appreciated. cheers in advance!


Solution

  • I'm making two assumptions:

    1. Site B, week 4 = 2 species, both "dog" and "rabbit"; and
    2. All sites share the same weeks, so if at least on site has week 4, then all sites should include it. This only drives the mt (empty) variable, feel free to update this variable.

    I first suggest an "empty" data.frame to ensure sites have the requisite week numbers populated:

    mt <- expand.grid(field_site = unique(ret$field_site),
                      week = unique(ret$week))
    

    The use of tidyr helps:

    library(tidyr)
    
    df %>%
      mutate(fake = TRUE) %>%
      # ensure all species are "represented" on each row
      spread(animal, fake) %>%
      # ensure all weeks are shown, even if no species
      full_join(mt, by = c("field_site", "week")) %>%
      # ensure the presence of a species persists at a site
      arrange(week) %>%
      group_by(field_site) %>%
      mutate_if(is.logical, funs(cummax(!is.na(.)))) %>%
      ungroup() %>%
      # helps to contain variable number of species columns in one place
      nest(-field_site, -week, .key = "species") %>%
      group_by(field_site, week) %>%
      # could also use purrr::map in place of sapply
      mutate(n = sapply(species, sum)) %>%
      ungroup() %>%
      select(-species) %>%
      arrange(field_site, week)
    # # A tibble: 12 × 3
    #    field_site   week     n
    #        <fctr> <fctr> <int>
    # 1           A      1     1
    # 2           A      2     2
    # 3           A      3     3
    # 4           A      4     3
    # 5           B      1     0
    # 6           B      2     1
    # 7           B      3     1
    # 8           B      4     2
    # 9           C      1     1
    # 10          C      2     1
    # 11          C      3     2
    # 12          C      4     3