Search code examples
rdataframeobject-slicing

Average over time series data with custom time slice window


I am working with some tennis ranking data in R, that gives the evolution of tennis rankings over time of all the players on the ATP tour.

An example of the data I am using can be found here, giving the rankings data from 2000's: https://github.com/JeffSackmann/tennis_atp/blob/master/atp_rankings_00s.csv

To clean up the data:

rankings <- read_csv("data/atp/atp_rankings_00s.csv")
rankings = rankings %>% 
  mutate(rankingDate = lubridate::ymd(ranking_date) ) %>% 
  select(-ranking_date)

Now, suppose I wish to trace the time evolution of each player over the entire decade, and calculate their mean ranking during this period. Then I can write:

rankings %>%  
  group_by(player) %>% 
  summarise(
    meanRanking = mean(rank, na.rm = TRUE),
  ) 

However, suppose I want something more. I want to slice up this data along the time axis, and calculate the mean ranking for these slices. Thus, with something like start=01-01-2000, end=01-01-2008, skip=2 years, I could have mean rankings over 2-year time windows for the period from 1 Jan 2000 to 1 Jan 2008. How would one code such a 'time slicing` in R?


Solution

  • Your data are reasonably large, and data.table can help quite a bit with speed. Here is an approach that is very fast, and it uses a flexible function f(s,e,p,u), which allows you to pass in any start (s) or end (e) date, an integer period (e.g. 2 for 2 years), and time unit (u) which takes values "y", "m", or "d", for years, months, days, respectively

    f <- function(s,e,p, u=c("y","m","d")) {
      u=match.arg(u)
      uf = list("y"=years,"m"=months,"d"=days)
      data.table(s = seq(as.Date(s), as.Date(e),by=paste(p,u)))[,`:=`(e=s %m+% uf[[u]](p), period=1:.N)]
    }
    

    Then you just apply the function to rankings in a non-equi join

    setDT(rankings)
    rankings[f("2000-01-01", "2008-01-01",2), on=.(ranking_date>=s, ranking_date<=e)] %>% 
      .[,.(ranking=mean(rank,na.rm=T)), by=.(player,period )]
    

    Output:

           player period     ranking
        1: 101736      1    2.769231
        2: 102338      1    5.211538
        3: 101948      1    4.730769
        4: 103017      1   23.259615
        5: 102856      1    2.538462
       ---                          
    13543: 105996      5 1780.500000
    13544: 105050      5 1665.333333
    13545: 105757      5 1781.000000
    13546: 121555      5 1328.500000
    13547: 106018      5 1508.000000
    

    For 3 month windows, just call f with p=3, and u="m"

    rankings[f("2000-01-01", "2008-01-01",3, "m"), on=.(ranking_date>=s, ranking_date<=e)] %>% 
      .[,.(ranking=mean(rank,na.rm=T)), by=.(player,period )]
    

    Output (for 3 month windows)

           player period     ranking
        1: 101736      1    1.000000
        2: 102338      1    2.666667
        3: 101948      1    2.333333
        4: 103017      1    4.000000
        5: 102856      1    5.500000
       ---                          
    62491: 105517     33 1502.000000
    62492: 104405     33 1588.000000
    62493: 104576     33 1588.000000
    62494: 105500     33 1679.000000
    62495: 108698     33 1844.000000