I have an animal tracking dataset which is as shown below
Id Start Stop Status
78122 10/12/1919 10/12/1919 Birth
78122 1/18/1966 2/2/1972 In
78122 2/3/1972 9/8/1972 In
78122 9/9/1972 1/23/1974 In
78122 1/24/1974 10/22/1975 Out
78122 10/23/1975 5/4/1979 Out
78122 5/5/1979 8/29/1980 Out
78122 8/30/1980 5/14/1988 Out
78122 5/15/1988 6/18/1988 In
78122 6/19/1988 1/12/1989 In
78122 1/13/1989 2/23/1990 In
78122 2/24/1990 6/15/1991 Out
78122 6/16/1991 2/11/1993 Out
78122 2/12/1993 5/3/1994 Out
78122 5/4/1994 7/27/1994 In
78122 7/22/1994 1/25/1996 Out
78122 1/26/1996 11/13/2001 In
78122 11/14/2001 11/19/2001 In
78122 11/20/2001 9/1/2009 In
78122 9/26/2009 9/26/2009 Death
This animal was born in 1919 but moved in and out of its native territory multiple times. What I want to create is a dataset like this. I like to summarize the min(Start)
and max(Stop)
dates by Status.
For example: there are three rows indicating the animal was in the territory between 1/18/1966
to 1/23/1974
.
Id Start Stop Status
78122 1/18/1966 2/2/1972 In
78122 2/3/1972 9/8/1972 In
78122 9/9/1972 1/23/1974 In
This information should be summarized into 1 single row with min(Start)
and max(Stop)
like this
Id MinStart MaxStop Status
78122 1/18/1966 1/23/1974 In
Again there are four rows indicating the animal was out of territory between 1/24/1974
to 5/14/1988
.
Id Start Stop Status
78122 1/24/1974 10/22/1975 Out
78122 10/23/1975 5/4/1979 Out
78122 5/5/1979 8/29/1980 Out
78122 8/30/1980 5/14/1988 Out
This information should be summarized into 1 single row with min(Start)
and max(Stop)
like this
Id MinStart MaxStop Status
78122 1/24/1974 5/14/1988 Out
Similarly for other In and Out status. The final dataset should look like this below.
Id MinStart MaxStop Status
78122 10/12/1919 10/12/1919 Birth
78122 1/18/1966 1/23/1974 In
78122 1/24/1974 5/14/1988 Out
78122 5/15/1988 2/23/1990 In
78122 2/24/1990 5/3/1994 Out
78122 5/4/1994 7/27/1994 In
78122 7/28/1994 1/25/1996 Out
78122 1/26/1996 9/1/2009 In
78122 9/26/2009 9/26/2009 Death
Any suggestions on how to rearrange this dataset based on the criteria above is much apricated. So far I tried
test1 <- testcase %>%
group_by(ID,Status) %>%
summarize(MinStart = min(Start), MaxStop= max(Stop))
but this doesn't seem to work. It just creates one min and stop date for all the In Status and Out Status together. That is incorrect.
You need some run length encoding. I'm going to use data.table::rleid
for convenience, but you can use the base version if you want:
library(data.table)
testcase %>%
group_by(Id, RLE = rleid(Status)) %>%
arrange(Start) %>%
dplyr::summarise(Start = min(Start), Stop = max(Stop), Status = first(Status))
# A tibble: 9 x 5
# Groups: Id [1]
Id RLE Start Stop Status
<int> <int> <date> <date> <chr>
1 78122 1 1919-10-12 1919-10-12 Birth
2 78122 2 1966-01-18 1974-01-23 In
3 78122 3 1974-01-24 1988-05-14 Out
4 78122 4 1988-05-15 1990-02-23 In
5 78122 5 1990-02-24 1994-05-03 Out
6 78122 6 1994-05-04 1994-07-27 In
7 78122 7 1994-07-22 1996-01-25 Out
8 78122 8 1996-01-26 2009-09-01 In
9 78122 9 2009-09-26 2009-09-26 Death
Note that I converted your dates into class date
, which I will leave to you. Otherwise they don't sort correctly.
And here is the group_by
call without data.table
...
group_by(Id, RLE = with(rle(Status), rep(seq_along(lengths), lengths))) %>%
...