Search code examples
rgroup-bydplyrconditional-statementssummarize

count distinct levels of a data frame for groups based on a condition


I have the following DF

x = data.frame('grp' = c(1,1,1,2,2,2),'a' = c(1,2,1,1,2,1), 'b'= c(6,5,6,6,2,6), 'c' = c(0.1,0.2,0.4,-1, 0.9,0.7))

  grp a b    c
1   1 1 6  0.1
2   1 2 5  0.2
3   1 1 6  0.4
4   2 1 6 -1.0
5   2 2 2  0.9
6   2 1 6  0.7 

I want to count distinct levels of (a,b) for each group where c >= 0.1

I have tried using dplyr for this using group_by & summarise but not getting the desired result

x %>% group_by(grp) %>% summarise(count = n_distinct(c(a,b)[c >= 0.1]))

For the above case I would expect the following result

    grp count
  <dbl> <int>
1     1     2
2     2     2

However using the above query I am getting the following result

    grp count
  <dbl> <int>
1     1     4
2     2     3

Logically the above output seems to be solving for all unique values of a concat list of (a,b) but not what I require Any pointers, really appreciate any help


Solution

  • We can paste a and b columns and count distinct values in each group.

    library(dplyr)
    
    x %>% 
      mutate(col = paste(a, b, sep = "_")) %>%
      group_by(grp) %>%
      summarise(count = n_distinct(col[c >= 0.1]))
    
    #    grp count
    #  <dbl> <int>
    #1     1     2
    #2     2     2