Search code examples
rdataframeaggregatesummarize

How to aggregate R dataframe of two columns based on values of another


My dataframe is as follows in which gender=="1" refers to men and gender=="2" refers to women, Occupations go from A to U and year goes from 2010 to 2018 (I give you a small example)

Gender   Occupation    Year
1            A         2010
1            A         2010
2            A         2010
1            B         2010
2            B         2010
1            A         2011
2            A         2011
1            C         2011
2            C         2011

I want an output that sums the number of rows in which gender and year and occupation is distinct like you can see next:

Year | Occupation | Men | Woman
2010 |      A     |  2  |   1
2010 |      B     |  1  |   1
2011 |      A     |  1  |   1
2011 |      C     |  1  |   1

I have tried the following:

Nr_gender_occupation <- data %>%
   group_by(year, occupation) %>%
   summarise(
      Men = aggregate(gender=="1" ~ occupation, FUN= count),
      Women = aggregate(gender=="2" ~ occupation, FUN=count)
)

Solution

  • You can also do a count within your groups:

    library(dplyr)
    
    df %>% 
      group_by(Occupation, Year) %>% 
      summarize(Men = sum(Gender == 1),
                Woman = sum(Gender == 2), .groups = "drop")
    

    Output

      Occupation  Year   Men Woman
      <chr>      <dbl> <int> <int>
    1 A           2010     2     1
    2 A           2011     1     1
    3 B           2010     1     1
    4 C           2011     1     1