Search code examples
rdplyraggregateplyrmelt

Managing duplicate entries in one column while keeping other columns intact in r


I have count data of species at multiple sites with information of site, year, month and day. Within this data, for several species there are multiple entries on certain days. For example, on 3rd of January, 2016, for species A1 there are two entries i.e. 10 and 20. At first step, I want to take the maximum value of this day i.e. 20. In the second step, if there are more than one day sampled per month per site then I want to take the mean per month. Example is given below

species site    year    month   day total
A1  GG  2016    1   3   10
A1  GG  2016    1   3   20
A1  GG  2016    1   4   22
A2  GG  2016    1   5   32
A2  GG  2016    1   6   34
A3  GG  2016    1   9   23

should look like something like this

species site    year    month   day total
A1  GG  2016    1   3.5 21
A2  GG  2016    1   5.5 33
A3  GG  2016    1   9   23

Solution

  • We group by the first five columns i.e. 'species', 'site', 'year', 'month', 'day', summarise to get the max of 'total', then group by without the 'day' and get the mean of 'day' and 'total'

    library(dplyr)
    df1 %>%
        group_by_at(names(.)[1:5]) %>% 
        summarise(total = max(total)) %>%
        group_by_at(names(.)[1:4]) %>%
        summarise_all(mean)
    # A tibble: 3 x 6
    # Groups: species, site, year [?]
    #   species site   year month   day total
    #   <chr>   <chr> <int> <int> <dbl> <dbl>
    #1 A1      GG     2016     1  3.50  21.0
    #2 A2      GG     2016     1  5.50  33.0
    #3 A3      GG     2016     1  9.00  23.0