Search code examples
rdataframedplyrdata-manipulation

Use dplyr to compute "streaks" in column of data


I don't think the title clearly explain the data calculation I need to do, so I have created a simple reproducible example below:

Here is the input dataframe

structure(list(homePoints = c(0, 0, 0, 0, 0, 0, 0, 
0, 0, 2, 2, 4, 4, 4, 4, 4, 4, 5), awayPoints = c(0, 
0, 0, 0, 0, 1, 2, 2, 2, 2, 5, 5, 8, 8, 8, 10, 10, 10), homeMargin = c(0, 
0, 0, 0, 0, -1, -2, -2, -2, 0, -3, -1, -4, -4, -4, -6, -6, -5
)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-18L))

# A tibble: 18 x 3
   homePoints awayPoints homeMargin
        <dbl>      <dbl>      <dbl>
 1          0          0          0
 2          0          0          0
 3          0          0          0
 4          0          0          0
 5          0          0          0
 6          0          1         -1
 7          0          2         -2
 8          0          2         -2
 9          0          2         -2
10          2          2          0
11          2          5         -3
12          4          5         -1
13          4          8         -4
14          4          8         -4
15          4          8         -4
16          4         10         -6
17          4         10         -6
18          5         10         -5

And I'm simply attempting to get the following output of "home streaks":

c(-2, 2, -3, 2, -5, 1)

Output explained - The away team starts the game by scoring 2 points (separately 1 point at a time over rows 6 and 7) -2, then the home team scores 2 points (row 10) 2, then the away team scores 3 points (row 11) -3, then the home team scores 2 points (row 12) 2, then the away team scores 5 points (rows 13, 16) -5, then the home team scores 1 point (row 18), 1. Positives for home streaks, negatives for away streaks.

Preferably these streaks would be created as a separate column onto the dataframe called streaks or something like that, and the dataframe would be filtered (in this instance from 18 to 6 rows), with no real preference as to which rows are lost.

Edit: a particular challenge is that I cannot simply look for changes in the homeMargin column, as a single change in the margin doesn't equal a streak. Rather, streaks involve a continuous change in the margin in the same direction.

Edit 2: my efforts are something along these lines so far:

my_data %>%
    dplyr::mutate(streakDirection = c(0, diff(zoo::as.zoo(homeMargin), na.pad = F))) %>%
    dplyr::mutate(signChange = c(0, diff(sign(streakDirection))))

However, this doesn't really get me to where I need to go.


Solution

  • Here's one approach:

    points %>%
      mutate(change_net = homeMargin - lag(homeMargin, default = 0),
             direction  = sign(change_net)) %>%
      filter(direction != 0) %>%
      mutate(streak = cumsum(direction != lag(direction, default = 0))) %>%
      count(streak, wt = change_net)
    
    # A tibble: 6 x 2
      streak     n
       <int> <dbl>
    1      1    -2
    2      2     2
    3      3    -3
    4      4     2
    5      5    -5
    6      6     1