I have a dataframe with with many distinct UniqueIDs, and which are also ordered by dates. Each UniqueID is sorted from oldest date to newest date. We also have a column called steps which is ordered from 1 to 4.
The goal is for each UniqueID is to find the oldest instance of the first Step, then the oldest instance of the second step etc. Some steps may be missing, for instance step 3 is missing for UniqueID = "B". In this case we skip over Step 3 and move on to step 4.
Here is the original dataframe.
UniqueID Date Step
1 A 2015-07-03 2
2 A 2015-07-07 3
3 A 2015-07-09 1
4 A 2015-07-14 4
5 A 2015-07-17 1
6 A 2015-07-20 2
7 A 2015-07-23 2
8 A 2015-07-24 3
9 A 2015-07-29 3
10 B 2015-06-01 3
11 B 2015-06-15 2
12 B 2015-06-22 1
13 B 2015-06-29 4
14 B 2015-07-13 2
15 B 2015-06-22 2
16 B 2015-07-08 2
17 B 2015-07-27 4
The valid entries we want to select are observations 3, 6, 8, 12, 14, 17. Creating this dataframe:
UniqueID Date Step
3 A 2015-07-09 1
6 A 2015-07-20 2
8 A 2015-07-24 3
12 B 2015-06-22 1
14 B 2015-07-13 2
17 B 2015-07-27 4
I have the logic and some pseudo code but can't put it together. So in the example data frame for UniqueID = "A" we would first group the dataframe:
group_by(UniqueID)
The find the lowest value for UniqueID = "A" and assign to a variable.
v <- min(Step)
returns 1
Then take the index for this step
i <- which.min(Step)
returns 3
We then want to find the min step that is greater than the first step, and only search the elements that occur after the first step. So now we are only searching for values of Step which are > 1, and only from the position of the first value we found onward, in this case from observation 3. We want to keep repeating this for all observations of each UniqueID until we either reach the last observation, or can no longer find an observation that is greater than the last observation in the remaining elements.
Here is the dput for creating the example dataframe:
structure(list(UniqueID = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("A",
"B"), class = "factor"), Date = structure(c(16619, 16623, 16625,
16630, 16633, 16636, 16639, 16640, 16645, 16587, 16601, 16608,
16615, 16629, 16608, 16624, 16643), class = "Date"), Step = c(2,
3, 1, 4, 1, 2, 2, 3, 3, 3, 2, 1, 4, 2, 2, 2, 4)), .Names = c("UniqueID",
"Date", "Step"), row.names = c(NA, -17L), class = "data.frame")
Alternative dput which crashes using jeremycg's method.
structure(list(UniqueID = structure(c(1L, 1L, 1L, 1L, 1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 8L, 8L, 9L, 9L, 10L, 11L), .Label = c("A","B",
"C","D","E","F","G","H","I","J","K"),
class = "factor"), Date = c("3/08/2015",
"21/07/2015", "7/07/2015", "7/07/2015", "29/07/2015", "29/07/2015",
"29/06/2015", "13/07/2015", "9/07/2015", "29/07/2015", "24/07/2015",
"2/07/2015", "16/07/2015", "18/06/2015", "8/07/2015", "29/07/2015",
"12/06/2015", "27/07/2015"), Step = c(1, 1, 4, 4, 4, 3,
5, 5, 1, 4, 1, 2, 2, 2, 3, 3, 2, 2)), .Names = c("UniqueID",
"Date", "Step"), class = c("tbl_df", "data.frame"
), row.names = c(NA, -18L))
Edit: dput of UniqueID that continues to crash even using updated code from jeremycg :
structure(list(UniqueID = structure(c(1L, 1L, 1L, 1L, 1L, 1L ), .Label = c("A" ), class = "factor"), Date = structure(c(16619, 16623, 16625, 16630, 16633, 16636), class = "Date"), Step = c(1, 5, 5, 1, 1, 1)), .Names = c("UniqueID", "Date", "Step"), row.names = c(NA, -6L), class = "data.frame")
Pretty inefficient, but working.
First define a function:
myseq <- function(df){
if(which.min(df$Step) == nrow(df)){
return(list(df[nrow(df),]))
}
store <- vector(mode = "list", length = nrow(df))
i=1
while(any(!is.na(df$Step))){
store[[i]] <- df[which.min(df$Step),]
df <- df[which.min(df$Step) : nrow(df), ]
df$Step[df$Step == min(df$Step)] <- NA
i = i+1
}
store
}
Then wrap it on the dataframe using dplyr
:
library(dplyr)
dta %>% group_by(UniqueID) %>%
do(do.call(rbind, myseq(.)))
Source: local data frame [6 x 3]
Groups: UniqueID
UniqueID Date Step
1 A 2015-07-09 1
2 A 2015-07-20 2
3 A 2015-07-24 3
4 B 2015-06-22 1
5 B 2015-07-13 2
6 B 2015-07-27 4