Search code examples
rdatedplyrsummarize

r min max dates by id and multiple status changes within ID


I have an animal tracking dataset which is as shown below

 Id         Start       Stop          Status
 78122      10/12/1919  10/12/1919    Birth
 78122      1/18/1966   2/2/1972      In
 78122      2/3/1972    9/8/1972      In
 78122      9/9/1972    1/23/1974     In
 78122      1/24/1974   10/22/1975    Out
 78122      10/23/1975  5/4/1979      Out
 78122      5/5/1979    8/29/1980     Out
 78122      8/30/1980   5/14/1988     Out
 78122      5/15/1988   6/18/1988     In
 78122      6/19/1988   1/12/1989     In
 78122      1/13/1989   2/23/1990     In
 78122      2/24/1990   6/15/1991     Out
 78122      6/16/1991   2/11/1993     Out
 78122      2/12/1993   5/3/1994      Out
 78122      5/4/1994    7/27/1994     In
 78122      7/22/1994   1/25/1996     Out
 78122      1/26/1996   11/13/2001    In
 78122      11/14/2001  11/19/2001    In
 78122      11/20/2001  9/1/2009      In
 78122      9/26/2009   9/26/2009     Death

This animal was born in 1919 but moved in and out of its native territory multiple times. What I want to create is a dataset like this. I like to summarize the min(Start) and max(Stop) dates by Status.

For example: there are three rows indicating the animal was in the territory between 1/18/1966 to 1/23/1974.

 Id         Start       Stop          Status
 78122      1/18/1966   2/2/1972      In
 78122      2/3/1972    9/8/1972      In
 78122      9/9/1972    1/23/1974     In

This information should be summarized into 1 single row with min(Start) and max(Stop) like this

 Id         MinStart    MaxStop       Status
 78122      1/18/1966   1/23/1974     In

Again there are four rows indicating the animal was out of territory between 1/24/1974 to 5/14/1988.

 Id         Start       Stop          Status
 78122      1/24/1974   10/22/1975    Out
 78122      10/23/1975  5/4/1979      Out
 78122      5/5/1979    8/29/1980     Out
 78122      8/30/1980   5/14/1988     Out

This information should be summarized into 1 single row with min(Start) and max(Stop) like this

 Id         MinStart    MaxStop       Status
 78122      1/24/1974   5/14/1988     Out

Similarly for other In and Out status. The final dataset should look like this below.

 Id         MinStart    MaxStop       Status
 78122      10/12/1919  10/12/1919    Birth
 78122      1/18/1966   1/23/1974     In
 78122      1/24/1974   5/14/1988     Out
 78122      5/15/1988   2/23/1990     In
 78122      2/24/1990   5/3/1994      Out
 78122      5/4/1994    7/27/1994     In
 78122      7/28/1994   1/25/1996     Out
 78122      1/26/1996   9/1/2009      In
 78122      9/26/2009   9/26/2009     Death

Any suggestions on how to rearrange this dataset based on the criteria above is much apricated. So far I tried

 test1 <- testcase %>% 
          group_by(ID,Status) %>% 
          summarize(MinStart  = min(Start), MaxStop= max(Stop))

but this doesn't seem to work. It just creates one min and stop date for all the In Status and Out Status together. That is incorrect.


Solution

  • You need some run length encoding. I'm going to use data.table::rleid for convenience, but you can use the base version if you want:

    library(data.table)
    testcase %>% 
      group_by(Id, RLE = rleid(Status)) %>%
      arrange(Start) %>%
      dplyr::summarise(Start = min(Start), Stop = max(Stop), Status = first(Status))
    # A tibble: 9 x 5
    # Groups:   Id [1]
         Id   RLE Start      Stop       Status
      <int> <int> <date>     <date>     <chr> 
    1 78122     1 1919-10-12 1919-10-12 Birth 
    2 78122     2 1966-01-18 1974-01-23 In    
    3 78122     3 1974-01-24 1988-05-14 Out   
    4 78122     4 1988-05-15 1990-02-23 In    
    5 78122     5 1990-02-24 1994-05-03 Out   
    6 78122     6 1994-05-04 1994-07-27 In    
    7 78122     7 1994-07-22 1996-01-25 Out   
    8 78122     8 1996-01-26 2009-09-01 In    
    9 78122     9 2009-09-26 2009-09-26 Death 
    

    Note that I converted your dates into class date, which I will leave to you. Otherwise they don't sort correctly.

    And here is the group_by call without data.table

    ...
      group_by(Id, RLE = with(rle(Status), rep(seq_along(lengths), lengths))) %>%
    ...