Search code examples
rsummarization

R function to summarize multiple columns of data with multiple functions, grouped by a column


I have a dataframe that contains the following columns:

  • game_id - chr, 1 ID for each game, multiple rows per game
  • home_lineup - chr
  • away_lineup - chr
  • home_plusminus - int
  • away_plusminus - int
  • home_team - chr
  • away_team - chr

I need to calculate the per game sum of the home_plusminus and away_plusminus for each home_lineup and each away_lineup.

The data looks like this:

game_id home_lineup awaylineup home_Plusminus Away_Plusminus home_team  away_team
12345   L1          L2          -2              2            BOS         ATL
12345   L3          L4           3             -3            BOS         ATL
12345   L3          L4           3             -3            BOS         ATL
45678   L2          L1           3             -3            ATL         BOS
45678   L2          L7           1             -1            ATL         BOS
45678   L8          L1           3             -3            ATL         BOS

The above data shows 2 games played.

I want the final output to look like this:

Team Lineup PlusMinus Pergame
BOS  L1     -8        -4.0
BOS  L3      6         6.0
BOS  L7     -1        -1.0
ATL  L2      6         3.0
ATL  L4     -6        -6.0
ATL  L8      3         3.0

So in the above example, L1 played in two games with a total plusMinus of -8. L3 only played in 1 game.


Solution

  • Here's an approach with tidyr and dplyr.

    library(tidyr); library(dplyr)
    
    # Step 1 - make into tidy data frame with one row per observation
    home <- df %>% select(game_id, contains("home")) %>% 
      rename("Lineup" = "home_lineup", "Team" = "home_team", "plusminus" = "home_Plusminus")
    
    away <- df %>% select(game_id, contains("away")) %>% 
      rename("Lineup" = "awaylineup", "Team" = "away_team", "plusminus" = "Away_Plusminus")
    
    tidy <- bind_rows(home, away, .id = "location")
    
    
    
    # Step 2 - summarize
    output <- tidy %>%
      group_by(Team, Lineup) %>%
      summarize(PlusMinus = sum(plusminus),
                PerGame = PlusMinus/n_distinct(game_id)) %>% ungroup()
    

    Output:

    > output
    # A tibble: 6 x 4
      Team  Lineup PlusMinus PerGame
      <chr> <chr>      <int>   <dbl>
    1 ATL   L2             6       3
    2 ATL   L4            -6      -6
    3 ATL   L8             3       3
    4 BOS   L1            -8      -4
    5 BOS   L3             6       6
    6 BOS   L7            -1      -1
    

    Sample data:

    df <- read.table(header = T, stringsAsFactors = F, text = "
                     game_id home_lineup awaylineup  home_Plusminus  Away_Plusminus  home_team   away_team
     12345  L1          L2          -2              2               BOS       ATL
         12345  L3          L4           3             -3               BOS       ATL
         12345  L3          L4           3             -3               BOS       ATL
         45678  L2          L1           3             -3               ATL       BOS
         45678  L2          L7           1             -1               ATL       BOS
         45678  L8          L1           3             -3               ATL       BOS")