Search code examples
rdplyrpanelrankpercentile

Rank computation considering time stamp in grouped data


In my game dataset, I have observations for several game players for several points in time. For each observation, I want to compute a rank for this player based on the number of points compared to the number of points of other players at this point in time. Therefore, it has to compare the points of this player of this observation with the number of points of all other players at their last (or time-wise closest in the past or exactly at the same second) observation of each of the others users.

Example data including the expected rank:

da = data.frame(player = c(1,1,1,2,2,2,3,3,3), date_sec = c(1451665633,1451665693,1451665721,1451665627,1451665692,1451665738,1451665626,1451665684,1451665765), points = c(100,150,200,130,140,230,80,90,100), rank = c(2,1,1,1,1,1,1,3,3))

da
  player   date_sec points rank
1      1 1451665633    100    2
2      1 1451665693    150    1
3      1 1451665721    200    1
4      2 1451665627    130    1
5      2 1451665692    140    1
6      2 1451665738    230    1
7      3 1451665626     80    1
8      3 1451665684     90    3
9      3 1451665765    100    3

E.g., player 2 has in date_sec 1451665738 rank 1, because at this point in time, no other player has had more points.

I know how to rank within groups, but I don't find a way to take the exact point in time into account here. A way would be to extract the day out of the timestamp and to group by day and player, but this is not not as accurate as I would like to have, because the rank can change several times within one day.

library(dplyr)

da2 = mutate(da, day = as.Date(as.POSIXct(date_sec, origin="1970-01-01"))) %>%
   group_by(player, day) %>%
   mutate(my_ranks = order(order(points, day, decreasing=TRUE)))

da2

A tibble: 9 x 6
# Groups:   player, day [3]
  player   date_sec points  rank day        my_ranks
   <dbl>      <dbl>  <dbl> <dbl> <date>        <int>
1      1 1451665633    100     2 2016-01-01        3
2      1 1451665693    150     1 2016-01-01        2
3      1 1451665721    200     1 2016-01-01        1
4      2 1451665627    130     1 2016-01-01        3
5      2 1451665692    140     2 2016-01-01        2
6      2 1451665738    230     2 2016-01-01        1
7      3 1451665626     80     3 2016-01-01        3
8      3 1451665684     90     3 2016-01-01        2
9      3 1451665765    100     3 2016-01-01        1

Next to the rank, I would also like to have a percentile rank based on the number of points, because there are different numbers of players active in different time periods.

Does anyone have an idea?


Solution

  • First, I would use complete so that you have a row for each player/date_sec combination. This will allow for easier comparisons for each moment in time.

    Next, I would use fill to carry forward each player's most recent score. The scores should be sorted/arranged before this.

    Then, since you have a score for each time stamp, you can group_by(date_sec) and order the players at each time for ranking.

    Finally, you can join back to original data frame to provide the desired ranks.

    library(tidyverse)
    
    da %>%
      complete(player, date_sec) %>%
      group_by(player) %>%
      arrange(date_sec) %>%
      fill(points) %>%
      group_by(date_sec) %>%
      mutate(my_ranks = order(order(points, decreasing = TRUE))) %>%
      right_join(da)
    

    Output

    Joining, by = c("player", "date_sec", "points", "rank")
    # A tibble: 9 x 5
    # Groups:   date_sec [9]
      player   date_sec points  rank my_ranks
       <dbl>      <dbl>  <dbl> <dbl>    <int>
    1      1 1451665633    100     2        2
    2      1 1451665693    150     1        1
    3      1 1451665721    200     2        1
    4      2 1451665627    130     1        1
    5      2 1451665692    140     2        1
    6      2 1451665738    230     1        1
    7      3 1451665626     80     3        1
    8      3 1451665684     90     3        3
    9      3 1451665765    100     3        3