Search code examples
riterationmutate

In R, how can I get the sum for all combinations of two variables?


I have a long dataset with students grades and subjects. I want to keep a long dataset, but I want to add a column that tells me how many Fs a student had in their humanity courses (English & History) and their STEM courses (biology & math). I also want the same for Ds, Cs, Bs, and As.

I know I could explicitly spell this out, but in the future, they might have other subjects (like adding Chemistry to STEM) or completely different categories, like Foreign Languages, so I want it to be scalable.

I know how to get all combinations of columns, and I know how to do to each part manually--but I don't know how to combine the two. Any help would be greatly appreciated!

#Sample data
library(tidyverse)

student_grades <- tibble(student_id = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5),
                      subject = c(rep(c("english", "biology", "math", "history"), 4), NA, "biology"),
                      grade = c(1, 2, 3, 4, 5, 4, 3, 2, 2, 4, 1, 1, 1, 1, 2, 3, 3, 4))
#All combinations of grades and subjects
all_subject_combos <- c("eng|his", "bio|math")
all_grades <- c("F", "D", "C", 
             "B", "A")

subjects_and_letter_grades <- expand.grid(all_subject_combos, all_grades)

all_combos <- subjects_and_letter_grades %>%
  unite("names", c(Var1, Var2)) %>%
  mutate(names = str_replace_all(names, "\\|", "_")) %>%
  pull(names)
#Manual generation of numbers of Fs by subject
#This is what I want the results to look like, but with all other letter grades

student_grades %>%
  group_by(student_id) %>%
  mutate(eng_his_F = sum((case_when(
    str_detect(subject, "eng|his") & grade == 1 ~ 1, 
    TRUE ~ 0)), na.rm = TRUE),
bio_math_F = sum((case_when(
  str_detect(subject, "bio|math") & grade == 1 ~ 1, 
  TRUE ~ 0)), na.rm = TRUE)) %>%
ungroup()

Ideally, this would be scalable for any number of subject combos and wouldn't require me to write out the same code for Ds, Cs, Bs, and As. Thank you!


Solution

  • We may loop over the all_combos vector with map, then within each list, do the grouping by 'student_id' (could also do this outside the loop and create an object to use this within here), create the new column with same name as looped by evaluating (!!) and using := operator on the sum of the output from case_when and bind the data with the original data

    library(dplyr)
    library(purrr)
    library(stringr)
    map_dfc(all_combos, ~ student_grades %>% 
      group_by(student_id) %>%
      transmute(!! .x := sum(case_when(str_detect(subject,
       str_replace(.x, "(\\w+)_(\\w+)_.", "\\1|\\2")) &
        grade == match(str_extract(.x, ".$"), all_grades)~ 1, TRUE ~ 0))) %>%
      ungroup %>% 
      dplyr::select(-student_id)) %>%
      bind_cols(student_grades, .)
    

    -output

    # A tibble: 18 × 13
       student_id subject grade eng_his_F bio_math_F eng_his_D bio_math_D eng_his_C bio_math_C eng_his_B bio_math_B eng_hi…¹ bio_m…²
            <dbl> <chr>   <dbl>     <dbl>      <dbl>     <dbl>      <dbl>     <dbl>      <dbl>     <dbl>      <dbl>    <dbl>   <dbl>
     1          1 english     1         1          0         0          1         0          1         1          0        0       0
     2          1 biology     2         1          0         0          1         0          1         1          0        0       0
     3          1 math        3         1          0         0          1         0          1         1          0        0       0
     4          1 history     4         1          0         0          1         0          1         1          0        0       0
     5          2 english     5         0          0         1          0         0          1         0          1        1       0
     6          2 biology     4         0          0         1          0         0          1         0          1        1       0
     7          2 math        3         0          0         1          0         0          1         0          1        1       0
     8          2 history     2         0          0         1          0         0          1         0          1        1       0
     9          3 english     2         1          1         1          0         0          0         0          1        0       0
    10          3 biology     4         1          1         1          0         0          0         0          1        0       0
    11          3 math        1         1          1         1          0         0          0         0          1        0       0
    12          3 history     1         1          1         1          0         0          0         0          1        0       0
    13          4 english     1         1          1         0          1         1          0         0          0        0       0
    14          4 biology     1         1          1         0          1         1          0         0          0        0       0
    15          4 math        2         1          1         0          1         1          0         0          0        0       0
    16          4 history     3         1          1         0          1         1          0         0          0        0       0
    17          5 <NA>        3         0          0         0          0         0          0         0          1        0       0
    18          5 biology     4         0          0         0          0         0          0         0          1        0       0
    # … with abbreviated variable names ¹​eng_his_A, ²​bio_math_A