I have a dataframe that contains the following columns:
I need to calculate the per game sum of the home_plusminus and away_plusminus for each home_lineup and each away_lineup.
The data looks like this:
game_id home_lineup awaylineup home_Plusminus Away_Plusminus home_team away_team
12345 L1 L2 -2 2 BOS ATL
12345 L3 L4 3 -3 BOS ATL
12345 L3 L4 3 -3 BOS ATL
45678 L2 L1 3 -3 ATL BOS
45678 L2 L7 1 -1 ATL BOS
45678 L8 L1 3 -3 ATL BOS
The above data shows 2 games played.
I want the final output to look like this:
Team Lineup PlusMinus Pergame
BOS L1 -8 -4.0
BOS L3 6 6.0
BOS L7 -1 -1.0
ATL L2 6 3.0
ATL L4 -6 -6.0
ATL L8 3 3.0
So in the above example, L1 played in two games with a total plusMinus of -8. L3 only played in 1 game.
Here's an approach with tidyr
and dplyr
.
library(tidyr); library(dplyr)
# Step 1 - make into tidy data frame with one row per observation
home <- df %>% select(game_id, contains("home")) %>%
rename("Lineup" = "home_lineup", "Team" = "home_team", "plusminus" = "home_Plusminus")
away <- df %>% select(game_id, contains("away")) %>%
rename("Lineup" = "awaylineup", "Team" = "away_team", "plusminus" = "Away_Plusminus")
tidy <- bind_rows(home, away, .id = "location")
# Step 2 - summarize
output <- tidy %>%
group_by(Team, Lineup) %>%
summarize(PlusMinus = sum(plusminus),
PerGame = PlusMinus/n_distinct(game_id)) %>% ungroup()
Output:
> output
# A tibble: 6 x 4
Team Lineup PlusMinus PerGame
<chr> <chr> <int> <dbl>
1 ATL L2 6 3
2 ATL L4 -6 -6
3 ATL L8 3 3
4 BOS L1 -8 -4
5 BOS L3 6 6
6 BOS L7 -1 -1
Sample data:
df <- read.table(header = T, stringsAsFactors = F, text = "
game_id home_lineup awaylineup home_Plusminus Away_Plusminus home_team away_team
12345 L1 L2 -2 2 BOS ATL
12345 L3 L4 3 -3 BOS ATL
12345 L3 L4 3 -3 BOS ATL
45678 L2 L1 3 -3 ATL BOS
45678 L2 L7 1 -1 ATL BOS
45678 L8 L1 3 -3 ATL BOS")