Search code examples
rdataframedplyrgrouping

Grouping sports data in a dataframe and aggregating it into a new table in R


Let's say I have a dataframe df like this:

       Home      Away HomeGoals AwayGoals HomePoints AwayPoints
1   Arsenal   Chelsea         3         1          3          0
2   Chelsea Liverpool         0         0          1          1
3 Liverpool   Arsenal         2         1          3          0
4   Chelsea   Arsenal         4         5          0          3
5 Liverpool   Chelsea         3         2          3          0
6   Arsenal Liverpool         2         2          1          1

I would like to produce a new dataframe FinalStandings showing team standings, sorted by Total Points, Goal Difference and Goals Scored, in descending order. So I would need to introduce new columns Team, TotalPoints , GoalDiff , and GoalsScored. So for Arsenal, the corresponding row of FinalStandings should look like this:

     Team TotalPoints GoalDiff GoalsScored
 Arsenal           7        2          11

I suppose I need to use the library(dplyr) (or maybe some other way) along with some sort of aggregating group_by() type functions, but I am struggling to implement it and get it to work. Please show how can one achieve the desired result efficiently?

Edit: note that each team appears in both "Home" and "Away" columns in df, so one cannot just simply group by Home or Awaycolumn.

One way I can think of is to do group_by on Home column and perform some summary operation, then do the same onAway, then usecbind to combine together. But seems I need to do the whole operation several times. Is there a faster way?


Solution

  • Another solution which is kind of similar in the single steps to the one from @Jon Spring but uses a different selection strategy to achieve the same result.

    Step 1 - create intermediate helper columns GoalDiff. The columns in df are alternating for Home and Away so we keep this format for the new columns.

    df_temp <- df %>%
      mutate(HomeGoalDiff = HomeGoals - AwayGoals,
             AwayGoalDiff = - HomeGoalDiff)
    

    Step 2 - omit the Home and Away prefixes as we know they are alternating and we select the columns by index.

    colnames(df_temp) <- rep(c("Team","Goal","Points","Diff"), each=2)
    

    Step 3 - Select the Home and Away columns to create another dataframe that can be used to aggregate the metrics per Team.

    FinalStandings <- bind_rows(
        df_temp %>% select(seq(1,7,2)), # home cols 1,3,5,7
        df_temp %>% select(seq(2,8,2))  # away cols 2,4,6,8
      ) %>%
      group_by(Team) %>%
      summarise(TotalPoints = sum(Points),
                GoalDiff = sum(Diff),
                GoalsScored = sum(Goal)) %>%
      arrange(desc(TotalPoints), desc(GoalDiff), desc(GoalsScored))
    

    Step 4 - print the result

    FinalStandings
    
    #  Team      TotalPoints GoalDiff GoalsScored
    #   <chr>           <dbl>    <dbl>       <dbl>
    # 1 Liverpool           8        2           7
    # 2 Arsenal             7        2          11
    # 3 Chelsea             1       -4           7