Search code examples
rgroup-bydplyrdistinct-values

R group_by and count distinct values in dataframe column with condition, using mutate


R group_by and count distinct values in dataframe column with condition, using mutate

This is similar to the question R group by | count distinct values grouping by another column but slightly different in that i want to mutate together with a condition. to count distinct values in each group, but also present in the vector c

id = c(  1,  1,  2,  2,  3,  4,  4,  5,  5,  5,  6,  6,  6,  6 )
val = c(100,100,200,300,400,500,500,500,600,600,200,200,300,500)
df <- data.frame(id=id, val=val)
c= c(200,500)

I know the following code is wrong, but it gives and idea of what i am trying to do. So, I want to create a fourth column, by grouping df by id, where the count of distinct values is given when it is present in c

x <- df %>% group_by(id) %>% 
  mutate(distinctValues = n_distinct(val %in% c ))

If we consider the id = 6, The distict values present in it are 200,300 and 500. But 300 is not present in c. So the count is = 2.

so the output column will be c(0,0,1,1,0,1,1,1,1,1,2,2,2,2)


Solution

  • Since c is unique, you can approach it from the other way - count the number of c values that show up in val.

    df %>% 
      group_by(id) %>% 
      mutate(distinctValues = sum(c %in% val))
    # # A tibble: 14 x 3
    # # Groups:   id [6]
    #       id   val distinctValues
    #    <dbl> <dbl>          <int>
    #  1     1   100              0
    #  2     1   100              0
    #  3     2   200              1
    #  4     2   300              1
    #  5     3   400              0
    #  6     4   500              1
    #  7     4   500              1
    #  8     5   500              1
    #  9     5   600              1
    # 10     5   600              1
    # 11     6   200              2
    # 12     6   200              2
    # 13     6   300              2
    # 14     6   500              2
    

    You could also use distinctValues = sum(unique(val) %in% c) if that seems clearer - it might be a tad less efficient, but not enough to matter unless your data is massive.