Search code examples
rdplyrgroupingdata-cleaning

Group by and summarize conditionally


My data table df shows employees, their employers, and whether they are hybrid workers. It looks like this:

name firm hybrid
Mike A    1
Kate A    1 
Dan  C    0
Jess B    0
Mick C    1
Nate C    0 
Mike A    1

I want to have a table that shows the number of employees and number of hybrid employees by firm:

firm total hybrid
A    3     3
B    1     0
C    3     1

What I tried so far is group_by (firm) %>% summarize (total = length(unique(name)), hybrid = length(unique(df[hybrid == 1]))) However, this approach gave me one observation per employee instead of firm. Does anyone know a better way to do this?


Solution

  • You need to use name[hybrid == 1] instead of df[hybrid == 1]. Note that dplyr has a wrapper function for length(unique(x)), n_distinct.

    library(dplyr)
    df %>% 
      group_by(firm) %>% 
      summarise(total = n_distinct(name),
                hybrid = n_distinct(name[hybrid == 1]))
    

    output

    # A tibble: 3 × 3
      firm  total hybrid
      <chr> <int>  <int>
    1 A         2      2
    2 B         1      0
    3 C         3      1