Search code examples
rrun-length-encoding

using rle() on a large dataframe


I'm trying to find the number of consecutive runs an ID has of any score, ordered by date. I've come across a simple method using apply() that seems to work, however, that just gives me the length of each run, not the score or ID associated with it.

The dataframe consists of the following:

DATE         ID     SCORE
2021-04-05   12690   67
2021-04-05   12278   47
2021-04-05   12153   64
---
2021-03-26   12690   88
2021-03-26   12278   47
---
2021-03-20   12690   67
+ 120,000 more rows

DATE is in YYYY-MM-DD format

Based off the sample df above, I'd like it to return something along the lines of this:

12690   67    1
12690   88    1
12690   67    1
12278   47    2
12153   64    1

i.e. for the ID 12690, the score 67 appears once, then 88 once, then 67 once, etc

I can make it work if I force it to use only one ID at a time, but this doesn't seem terribly efficient. Whats the best way I can go about this? TIA


Solution

  • Here's an attempt that works on the data provided. If it doesn't work on your real data, post a bit more complex example dataset and we can try again.

    library(dplyr)
    #> 
    #> Attaching package: 'dplyr'
    #> The following objects are masked from 'package:stats':
    #> 
    #>     filter, lag
    #> The following objects are masked from 'package:base':
    #> 
    #>     intersect, setdiff, setequal, union
    dat <- tibble::tribble(
      ~DATE,         ~ID,     ~SCORE,
    "2021-04-05", 12690,   67,
    "2021-04-05",   12278,   47,
    "2021-04-05", 12153,   64,
    "2021-03-26",   12690,   88,
    "2021-03-26", 12278,   47,
    "2021-03-20",   12690,   67) %>% 
      mutate(DATE = lubridate::ymd(DATE))
    
    
    tmp <- dat %>% 
      arrange(ID, DATE) %>% 
      group_by(ID) %>% 
      mutate(eq = SCORE == lead(SCORE), 
             eq = case_when(is.na(eq) ~ FALSE, 
                         TRUE ~ eq), 
             eq = case_when(!eq ~ SCORE == lag(SCORE), 
                            TRUE ~ eq), 
             eq = case_when(is.na(eq) ~ FALSE, 
                            TRUE ~ eq)) 
    
    tmps <- tmp %>% split(tmp$eq)
    tmps[[2]] <- tmps[[2]] %>% 
      group_by(ID, SCORE) %>% 
      summarise(n = sum(eq))
    #> `summarise()` has grouped output by 'ID'. You can override using the `.groups`
    #> argument.
    tmps[[1]] <- tmps[[1]] %>%
      select(ID, SCORE) %>% 
      mutate(n=1)
    
    bind_rows(tmps)
    #> # A tibble: 5 × 3
    #> # Groups:   ID [3]
    #>      ID SCORE     n
    #>   <dbl> <dbl> <dbl>
    #> 1 12153    64     1
    #> 2 12690    67     1
    #> 3 12690    88     1
    #> 4 12690    67     1
    #> 5 12278    47     2
    

    Created on 2022-04-13 by the reprex package (v2.0.1)


    Here's an alternative that may keep different runs of the same number for the same ID separate.

    library(dplyr)
    #> 
    #> Attaching package: 'dplyr'
    #> The following objects are masked from 'package:stats':
    #> 
    #>     filter, lag
    #> The following objects are masked from 'package:base':
    #> 
    #>     intersect, setdiff, setequal, union
    dat <- tibble::tribble(
      ~DATE,         ~ID,     ~SCORE,
      "2021-04-05", 12690,   67,
      "2021-04-05",   12278,   47,
      "2021-04-05", 12153,   64,
      "2021-03-26",   12690,   88,
      "2021-03-26", 12278,   47,
      "2021-03-20",   12690,   67) %>% 
      mutate(DATE = lubridate::ymd(DATE))
    
    tmp <- dat %>% 
      arrange(ID, DATE) %>% 
      group_by(ID) %>% 
      mutate(obs = 1:n()) %>% 
      ungroup %>% 
      mutate(new_grp = as.numeric(SCORE != lag(SCORE) | ID != lag(ID) | obs != lag(obs) + 1), 
             new_grp = ifelse(is.na(new_grp), 1, new_grp), 
             group = cumsum(new_grp)) %>% 
      group_by(ID, SCORE, group) %>% 
      summarise(n = n()) %>% 
      ungroup %>% 
      select(-group)
    #> `summarise()` has grouped output by 'ID', 'SCORE'. You can override using the
    #> `.groups` argument.
    
    tmp
    #> # A tibble: 5 × 3
    #>      ID SCORE     n
    #>   <dbl> <dbl> <int>
    #> 1 12153    64     1
    #> 2 12278    47     2
    #> 3 12690    67     1
    #> 4 12690    67     1
    #> 5 12690    88     1
    

    Created on 2022-04-14 by the reprex package (v2.0.1)