Search code examples
rdplyrfilterunique

Cumulative sum of unique values based on multiple criteria


I've got a df with multiple columns containing information of species sightings over the years in different sites, therefore each year might show multiple records. I would like to filter my df and calculate some operations based on certain columns, but I'd like to keep all columns for further analyses. I had some previous code using summarise but as I would like to keep all columns I was trying to avoid using it.

Let's say the columns I'm interested to work with at the moment are as follows:

df <- data.frame("Country" = LETTERS[1:5], "Site"=LETTERS[6:10], "species"=1:5, "Year"=1981:2010)

I would like to calculate:

1- The cumulative sum of the records in which a species has been documented within each site creating a new column "Spsum". 2- The number of different years that each species has been seen on a particular site, this could be done as cumulative sum as well, on a new column "nYear".

For example, if species 1 has been recorded 5 times in 1981, and 2 times in 1982 in Site G, Spsum would show 7 (cumulative sum of records) whereas nYear would show 2 as it was spotted over two different years. So far I've got this, but nYear is displaying 0s as a result.

 Df1 <- df %>%
    filter(Year>1980)%>%
    group_by(Country, Site, Species, Year) %>%
    mutate(nYear = n_distinct(Year[Species %in% Site]))%>%
    ungroup()

Thanks!


Solution

  • This cound help, without the need for a join.

    df %>% arrange(Country, Site, species, Year) %>% 
      filter(Year>1980) %>%
      group_by(Site, species) %>%
      mutate(nYear = length(unique(Year))) %>% 
      mutate(spsum = rowid(species))
    
    # A tibble: 30 x 6
    # Groups:   Site, species [5]
       Country Site  species  Year nYear spsum
       <chr>   <chr>   <int> <int> <int> <int>
     1 A       F           1  1981     6     1
     2 A       F           1  1986     6     2
     3 A       F           1  1991     6     3
     4 A       F           1  1996     6     4
     5 A       F           1  2001     6     5
     6 A       F           1  2006     6     6
     7 B       G           2  1982     6     1
     8 B       G           2  1987     6     2
     9 B       G           2  1992     6     3
    10 B       G           2  1997     6     4
    # ... with 20 more rows