Search code examples
rdplyrsummarization

How to group_by(x) and summarise by counting distinct(y) for each x level?


I have the following situation:

V1 V2
A A1
A A1
A A1
A A2
A A2
A A3
B B1
B B2
B B2

and i need to group by V1, and summarise counting how many distinct groups each V1 level has in V2. Something like this:

V1 n
A 3
B 2

How can i use dplyr funcitons to solve that?

Thanks!!


Solution

  • We can use rle after grouping by 'V1'

    library(dplyr)
    df1 %>%
       group_by(V1) %>%
       summarise(n = length(rle(V2)$values), .groups = 'drop')
    

    -output

    # A tibble: 2 × 2
      V1        n
      <chr> <int>
    1 A         3
    2 B         2
    

    Or with rleid and n_distinct

    library(data.table)
    df1 %>% 
      group_by(V1) %>% 
      summarise(n = n_distinct(rleid(V2)))
    # A tibble: 2 × 2
      V1        n
      <chr> <int>
    1 A         3
    2 B         2
    

    data

    df1 <- structure(list(V1 = c("A", "A", "A", "A", "A", "A", "B", "B", 
    "B"), V2 = c("A1", "A1", "A1", "A2", "A2", "A1", "B1", "B2", 
    "B2")), class = "data.frame", row.names = c(NA, -9L))