Search code examples
dplyrmaxunique

Summarise without reducing the number of rows in the data.frame


I was trying to come up with a dplyr solution of extracting the max number of observations of a species within a month and a particular year.

Sp<- data.frame(Species = c("A","A","A","A","A","A","A","A","A","A","A","B","B","B","B","B","C","C","C","C","C","C","C","C","C","C","C","C","C","C","C","C","C","C","C","C","C","C"),
Month= c("2","3","4","4","4","7","3","4","4","5","6","4","5","6","6","7","4","3","4","5","6","7","3","4","4","5","6","4","5","6","4","5","6","2","2","8","5","5"),
Year= c("2018","2005","2006","2006","2006","2006","2008","2008","2008","2008","2018","2009","2009","2009","2018","2009","2009","2009","2009","2009","2018","2005","2005","2005","2006","2006","2005","2006","2005","2005","2005","2009","2009","2009","2018","2018","2018","2018"))

For the above chunk of data I'd be interested in getting to know the max number of times in which each species were observed per month each year, and display that information on a new column.

So let's say that for example there were 5 records of Species A in month 4 in 2008. Then my new column would display 5,5,5,5,5... and so on. I've tried different combinations of cumsum, rowid, lenght(unique) but I never get to the expected outcome of that column just displaying max values all along for those groups.

It is important that I keep all my columns too, so summarise wouldn't be an option.

Thanks!


Solution

  • It sounds like you want to produce a summary without reducing the number of rows.

    So:

    Sp %>%
      group_by(Species, Month, Year) %>%
      summarise(num = n(), .groups = "drop")
    

    Makes the numbers you need, but reduces the number of rows down to one row per distinct combination of the grouping varaibles.

    In this case you can use mutate:

    Sp %>%
      group_by(Species, Month, Year) %>%
      mutate(num = n()) %>%
      ungroup()
    

    The same numbers are produced, but they appear in each row to which they correspond. So the output contains the same number of rows as the input.

    If you then wanted the maximum observations for a species in any month of that year, we could repeat this approach:

    Sp %>%
      group_by(Species, Month, Year) %>%
      mutate(num = n()) %>%
      group_by(Sepcies, Year) %>%
      mutate(max = max(num)) %>%
      ungroup()