Search code examples
rdplyrfilteringunique

how to determine the number of unique values based on multiple criteria dplyr


I've got a df that looks like:

df(site=c(A,B,C,D,E), species=c(1,2,3,4), Year=c(1980:2010).

I would like to calculate the number of different years that each species appear in each site, creating a new column called nYear, I've tried filtering by group and using mutate combined with ndistinct values but it is not quite working.

Here is part of the code I have been using:

Df1 <- Df %>%
  filter(Year>1985)%>%
  mutate(nYear = n_distinct(Year[Year %in% site]))%>%
  group_by(Species,Site, Year) %>% 
  arrange(Species, .by_group=TRUE) 
  ungroup()

Solution

  • The approach is good, a few things to correct.

    First, let's make some reproducible data (your code gave errors).

    df <- data.frame("site"=LETTERS[1:5], "species"=1:5, "Year"=1981:2010)
    

    You should have used summarise instead of mutate when you're looking to summarise values across groups. It will give you a shortened tibble as an output, with only the groups and the summary figures present (fewer columns and rows).

    mutate on the other hand aims to modify an existing tibble, keeping all rows and columns by default.

    The order of your functions in the chains also needs to change.

    df %>%
      filter(Year>1985) %>%
      group_by(species,site) %>% 
      summarise(nYear = length(unique(Year))) %>% # instead of mutate
      arrange(species, .by_group=TRUE) %>% 
    ungroup()
    

    First, group_by(species,site), not year, then summarise and arrange.

    # A tibble: 5 × 3
      species site  nYear
        <int> <chr> <int>
    1       1 A         5
    2       2 B         5
    3       3 C         5
    4       4 D         5
    5       5 E         5