Search code examples
rdataframesummarize

R: Using group_by for all values


I am working with the R programming language.

I have the following dataset:

library(dplyr)

df = structure(list(ethnicity = c("c", "c", "c", "b", "c", "b", "b", 
"b", "c", "a", "b", "b", "a", "b", "c", "a", "c", "c", "a", "a", 
"a", "a", "c", "b", "c", "b", "a", "b", "c", "b", "a", "c", "c", 
"a", "c", "b", "a", "c", "a", "a", "b", "c", "c", "a", "c", "a", 
"c", "b", "a", "b", "a", "a", "c", "a", "b", "a", "a", "c", "a", 
"b", "a", "c", "a", "c", "b", "c", "b", "b", "c", "b", "b", "c", 
"c", "a", "b", "b", "a", "b", "a", "a", "b", "c", "c", "a", "b", 
"a", "b", "a", "c", "c", "b", "c", "a", "b", "b", "c", "b", "a", 
"c", "c"), number_of_degrees = c(3L, 2L, 2L, 3L, 1L, 1L, 3L, 
2L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 1L, 2L, 2L, 2L, 3L, 2L, 3L, 2L, 
3L, 1L, 3L, 3L, 3L, 1L, 3L, 3L, 2L, 2L, 2L, 3L, 3L, 3L, 2L, 1L, 
2L, 1L, 3L, 3L, 2L, 1L, 3L, 1L, 3L, 2L, 2L, 1L, 3L, 2L, 1L, 3L, 
3L, 3L, 1L, 2L, 2L, 1L, 2L, 3L, 3L, 1L, 2L, 1L, 2L, 3L, 3L, 1L, 
3L, 2L, 1L, 1L, 2L, 3L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 3L, 
1L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 3L, 3L, 2L, 1L, 2L)), class = "data.frame", row.names = c(NA, 
-100L))


df %>%
    # Group the data by number_of_degrees
    group_by(number_of_degrees) %>%
    # Calculate the percentage of each ethnicity within each group
    summarize(
        percent_a = mean(ethnicity == "a") * 100,
        percent_b = mean(ethnicity == "b") * 100,
        percent_c = mean(ethnicity == "c") * 100
    )

This produces the following output:

# A tibble: 3 x 4
  number_of_degrees percent_a percent_b percent_c
              <int>     <dbl>     <dbl>     <dbl>
1                 1      33.3      36.7      30  
2                 2      31.6      21.1      47.4
3                 3      34.4      40.6      25  

My Question: Is there a more "compact" way to write this code such that I don't have to manually write "percent_a","percent_b", etc.? This way, it would be much faster and automatically do it for all values of ethnicity.


Solution

  • Probably you can try this base R option (the column names might be a bit different from the desired output)

    > aggregate(. ~ number_of_degrees, df, \(x) proportions(table(x)))
      number_of_degrees ethnicity.a ethnicity.b ethnicity.c
    1                 1   0.3333333   0.3666667   0.3000000
    2                 2   0.3157895   0.2105263   0.4736842
    3                 3   0.3437500   0.4062500   0.2500000
    

    or

    reshape(
        as.data.frame(proportions(table(df), 2)),
        direction = "wide",
        idvar = "number_of_degrees",
        timevar = "ethnicity"
    )
    

    which gives

      number_of_degrees    Freq.a    Freq.b    Freq.c
    1                 1 0.3333333 0.3666667 0.3000000
    4                 2 0.3157895 0.2105263 0.4736842
    7                 3 0.3437500 0.4062500 0.2500000
    

    Or, a less compact option with dplyr (sorry for my limited tidyverse knowledge)

    table(rev(df)) %>%
        proportions(1) %>%
        as.data.frame.matrix() %>%
        rownames_to_column(var = "number_of_degrees") %>%
        mutate(number_of_degrees = as.integer(number_of_degrees))
    

    which gives

      number_of_degrees         a         b         c
    1                 1 0.3333333 0.3666667 0.3000000
    2                 2 0.3157895 0.2105263 0.4736842
    3                 3 0.3437500 0.4062500 0.2500000