Following data frame in data.table
df <- data.table (id=c(1,1,2,2,3,3,4,4),
date=c("2013-11-22","2017-01-24","2017-06-24","2020-02-10","2011-01-03","2013-11-24","2015-01-24","2017-08-24"),
status=c("Former","Current","Former","Never","Current",NA,"Current","Former"))
df
id date status
1: 1 2013-11-22 Former
2: 1 2017-01-24 Current
3: 2 2017-06-24 Former
4: 2 2020-02-10 Never
5: 3 2011-01-03 Current
6: 3 2013-11-24 <NA>
7: 4 2015-01-24 Current
8: 4 2017-08-24 Former
I want to create a unique row per id with the following logicals. The latest date
should be kept. If the status
at latest date is <NA>
or Never
and there was another status
for an earlier date, than the row with the earlier date should be kept.
I solved this with the following functions:
unique1 <- df[df$status %in% c("Former","Current"),]
unique1 <- unique1[,.SD[which.max(anydate(date))],by=.(id)]
unique_final <- unique(df[order(id,ordered(status,c("Former","Current","Never",NA)))],by='id')
unique_final[match(unique1$id,unique_final$id),]<-unique1
and get these results
id date status
1: 1 2017-01-24 Current
2: 2 2017-06-24 Former
3: 3 2011-01-03 Current
4: 4 2017-08-24 Former
Is there a way to combine these two logical subsetting steps? I would like to avoid creating a new data frame and than matching them.
I am working with data.table
and a solution for a larger data set would be great.
Thanks!
Could try:
library(data.table)
df[, .SD[
if (all(status %in% c(NA, 'Never'))) .N
else max(which(!status %in% c(NA, 'Never')))
], by = id]
Output:
id date status
1: 1 2017-01-24 Current
2: 2 2017-06-24 Former
3: 3 2011-01-03 Current
4: 4 2017-08-24 Former