I am currently trying to create a new variable to return the earliest date across two variables. I've used the pmin()
function and have had no problem in the instances where there are dates available across both variables. In the instance where there is a missing date on one variable, but a date in the other, ideally I would like the code to return the date in the latter variable. Instead, an NA is returned and I haven't figured out how to solve this yet.
My code is as follows:
df = data.frame(ID = c(1001, 1002, 1003, 1004, 1005),
disease_code = c('1', '1' ,'0', '1','0'),
Date_of_diagnosis_1 = c(NA,'13/06/1997',NA,'18/02/2005',NA),
Date_of_diagnosis_2 = c('12/06/1998',NA,NA,'18/09/2001',NA))
df$Date_of_diagnosis_1 <- as.Date(df$Date_of_diagnosis_1, format="%d/%m/%Y")
df$Date_of_diagnosis_2 <- as.Date(df$Date_of_diagnosis_2, format="%d/%m/%Y")
df <- transform(df, earliest_date = pmin(Date_of_diagnosis_1,
Date_of_diagnosis_2, na.rm=TRUE))
UPDATE: na.rm was in the wrong position as highlighted by users Duck and Wimpel. The above code now works as desired.
Here is the data.table
way of things
library( data.table )
#make it a data.table
setDT(df)
#set dates as real date
cols = grep( "^Date", names(df), value = TRUE )
df[, (cols) := lapply( .SD, as.Date, format = "%d/%m/%Y"), .SDcols = cols]
#find minumum date in date-columns
df[, earliest_date := do.call( pmin, c( .SD, list( na.rm = TRUE ) ) ), .SDcols = cols ]
# ID disease_code Date_of_diagnosis_1 Date_of_diagnosis_2 earliest_date
# 1: 1001 1 <NA> 1998-06-12 1998-06-12
# 2: 1002 1 1997-06-13 <NA> 1997-06-13
# 3: 1003 0 <NA> <NA> <NA>
# 4: 1004 1 <NA> 2001-09-18 2001-09-18
# 5: 1005 0 <NA> <NA> <NA>