Search code examples
rdataframeaggregategrouping

Grouping a dataframe in R by multiply columns and performing calculations on grouped data


I have a dataframe df in R of sports teams, including a column Season, Date, and Goals for every match, like so:

  Season       Date      Home      Away Goals
1    2013 06/04/2013   Arsenal   Chelsea     3
2    2013 06/04/2013 Tottenham Newcastle     5
3    2013 10/04/2013     Milan   Rangers     4
4    2013 07/05/2013     Real     Bayern     1
5    2013 07/05/2013  Juventus     Inter     2
6    2013 07/06/2013 Barcelona   Chelsea     1
7    2014 03/04/2014     Braga   Benfica     3
8    2014 13/04/2014   Rangers   Chelsea     1
9    2014 08/05/2014    Bayern Tottenham     0
10   2014 10/05/2014 Newcastle     Braga     2

I would like to group this data into months for each season and calculate :

  1. total goals combined over all matches played for each month
  2. average goals scored per match for each month
  3. average goals scored per match for each season
  4. also display additional column stating every month for each season

My aim is to get a new dataframe like this:

  Season month TotalGoalsMonth AvGoalsMonth AvGoalsSeason
1   2013     4              12          4.0          2.67
2   2013     5               3          1.5          2.67
3   2013     6               1          1.0          2.67
4   2014     4               4          2.0          1.50
5   2014     5               2          2.0          1.50

So for season 2013, there were months 4, 5, 6. In months 4 there were 3 matches played yielding a total of 12 goals (TotalGoalsMonth column). The average goals per month is 12/3=4.0 (AvGoalsMonth column). And the same for months 5, 6, then repeat for season 2014, etc.

A total of 6 matches were played in 2013, with a total of 16 goals, so average goals per 2013 season is 16 / 6 = 2.67 (AvGoalsSeason column). Then repeat for season 2014, and so on.

Please explain how to do this in R.


Solution

  • Turn the dates into Date class, extract month from it. Calculate Total goals, Average goals and number of matches for each month.

    library(dplyr)
    library(lubridate)
    
    df %>%
      mutate(Date = dmy(Date), 
             Month = month(Date)) %>%
      summarise(TotalGoalsMonth = sum(Goals), 
                AvGoalsMonth = mean(Goals),
                NumOfMatches = n(),
             .by = c(Season, Month)) %>%
      mutate(AvGoalsSeason = sum(TotalGoalsMonth)/sum(NumOfMatches), .by = Season)
    
    #  Season Month TotalGoalsMonth AvGoalsMonth NumOfMatches AvGoalsSeason
    #1   2013     4              12          4.0            3      2.666667
    #2   2013     5               3          1.5            2      2.666667
    #3   2013     6               1          1.0            1      2.666667
    #4   2014     4               4          2.0            2      1.500000
    #5   2014     5               2          1.0            2      1.500000
    

    If your data is huge, you may use collapse whose syntax is similar to that of dplyr but it is much faster. (Even faster than base R in many cases.) Here's a dplyr-free version.

    library(collapse)
    
    df |>
      fmutate(Date = as.Date(Date, '%d/%m/%Y'), 
             Month = format(Date, '%m')) %>%
      fgroup_by(Season, Month) |>
      fsummarise(TotalGoalsMonth = sum(Goals), 
                AvGoalsMonth = mean(Goals),
                NumOfMatches = GRPN()) |>
      fgroup_by(Season) |>
      fmutate(AvGoalsSeason = sum(TotalGoalsMonth)/sum(NumOfMatches))