Search code examples
rtidyverse

How to summarize the top n values across multiple columns row wise?


In my dataframe, I have multiple columns with student grades. I want to sum the "Quiz" columns (e.g., Quiz1, Quiz2). However, I only want to sum the top 2 values, and ignore the others. I want to create a new column with the total (i.e., the sum of the top 2 values).

One issue is that some students have grades that tie for the top 2 grades in a given row. For example, Aaron has a high score of 42, but then there are two scores that tie for the second highest (i.e., 36).

Data

df <- 
  structure(
  list(
    Student = c("Aaron", "James", "Charlotte", "Katie", "Olivia", 
                "Timothy", "Grant", "Chloe", "Judy", "Justin"),
    ID = c(30016, 87311, 61755, 55323, 94839, 38209, 34096, 
           98432, 19487, 94029),
    Quiz1 = c(31, 25, 41, 10, 35, 19, 27, 42, 15, 20),
    Quiz2 = c(42, 33, 34, 22, 23, 38, 48, 49, 23, 30),
    Quiz3 = c(36, 36, 34, 32, 43, 38, 44, 42, 42, 37),
    Quiz4 = c(36, 43, 39, 46, 40, 38, 43, 35, 41, 41)
  ),
  row.names = c(NA, -10L),
  class = c("tbl_df", "tbl", "data.frame")
)

I know that I can use pivot_longer to do this, which allows me to arrange by group, then take the top 2 values for each student. This works fine, but I would like a more efficient way with tidyverse, rather than having to pivot back and forth.

What I Tried

library(tidyverse)

df %>%
  pivot_longer(-c(Student, ID)) %>%
  group_by(Student, ID) %>%
  arrange(desc(value), .by_group = TRUE) %>%
  slice_head(n = 2) %>%
  pivot_wider(names_from = name, values_from = value) %>%
  ungroup() %>%
  mutate(Total = rowSums(select(., starts_with("Quiz")), na.rm = TRUE))

I also know that if I wanted to sum all the columns on each row, then I could use rowSums, as I made use of above. However, I am unsure how to do rowSums of just the top 2 values in the 4 quiz columns.

Expected Output

# A tibble: 10 × 7
   Student      ID Quiz2 Quiz3 Quiz1 Quiz4 Total
   <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 Aaron     30016    42    36    NA    NA    78
 2 Charlotte 61755    NA    NA    41    39    80
 3 Chloe     98432    49    NA    42    NA    91
 4 Grant     34096    48    44    NA    NA    92
 5 James     87311    NA    36    NA    43    79
 6 Judy      19487    NA    42    NA    41    83
 7 Justin    94029    NA    37    NA    41    78
 8 Katie     55323    NA    32    NA    46    78
 9 Olivia    94839    NA    43    NA    40    83
10 Timothy   38209    38    38    NA    NA    76

Solution

  • You do not have to do pivot_wider. Note that the longer format is the tidy format. Just do pivot_longer and left_join:

    df %>% 
      left_join(pivot_longer(., -c(Student, ID)) %>%
      group_by(Student, ID) %>%
      summarise(Total = sum(sort(value, TRUE)[1:2]), .groups = 'drop'))
    
    # A tibble: 10 x 7
       Student      ID Quiz1 Quiz2 Quiz3 Quiz4 Total
       <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
     1 Aaron     30016    31    42    36    36    78
     2 James     87311    25    33    36    43    79
     3 Charlotte 61755    41    34    34    39    80
     4 Katie     55323    10    22    32    46    78
     5 Olivia    94839    35    23    43    40    83
     6 Timothy   38209    19    38    38    38    76
     7 Grant     34096    27    48    44    43    92
     8 Chloe     98432    42    49    42    35    91
     9 Judy      19487    15    23    42    41    83
    10 Justin    94029    20    30    37    41    78