Search code examples
rgroup-bydplyrdistinctsummarize

R tibble: Group by column A, keep only distinct values in column B and C and sum values in column C


I want to group by column A and then sum values in column C for distinct values in columns B and C. Is it possible to do it inside summarise clause? I know that's possible with distinct() function before aggregation. What about something like that: Data:

df <- tibble(A = c(1,1,1,2,2), B = c('a','b','b','a','a'), C=c(5,10,10,15,15))

My try that doesn't work:

df %>% 
group_by(A) %>% 
summarise(sumC=sum(distinct(B,C) %>% select(C)))

Desired ouput:

A sumC
1 15
2 15

Solution

  • You could use duplicated

    df %>%
        group_by(A) %>%
        summarise(sumC = sum(C[!duplicated(B)]))
    ## A tibble: 2 x 2
    #      A  sumC
    #  <dbl> <dbl>
    #1     1    15
    #2     2    15
    

    Or with distinct

    df %>%
        group_by(A) %>%
        distinct(B, C) %>%
        summarise(sumC = sum(C))
    ## A tibble: 2 x 2
    #      A  sumC
    #  <dbl> <dbl>
    #1     1    15
    #2     2    15