There is a longitudinal data set in the wide format, from which I want to compute time (in years and days) between the first observation date and the last date an individual was observed. Dates are in the format yyyy-mm-dd. The data set has four observation periods with missing dates, an example is as follows
df1<-data.frame("id"=c(1:4),
"adate"=c("2011-06-18","2011-06-18","2011-04-09","2011-05-20"),
"bdate"=c("2012-06-15","2012-06-15",NA,"2012-05-23"),
"cdate"=c("2013-06-18","2013-06-18","2013-04-09",NA),
"ddate"=c("2014-06-15",NA,"2014-04-11",NA))
Here "adate" is the first date and the last date is the date an individual was last seen. To compute the time difference (lastdate-adate), I have tried using "lubridate" package, for example
lubridate::time_length(difftime(as.Date("2012-05-23"), as.Date("2011-05-20")),"years")
However, I'm challenged by the fact that the last date is not coming from one column. I'm looking for a way to automate the calculation in R. The expected output would look like
id years days
1 1 2.99 1093
2 2 2.00 731
3 3 3.01 1098
4 4 1.01 369
Years is approximated to 2 decimal places.
Another tidyverse
solution can be done by converting the data to long format, removing NA
dates, and getting the time difference between last and first date for each id
.
library(dplyr)
library(tidyr)
library(lubridate)
df1 %>%
pivot_longer(-id) %>%
na.omit %>%
group_by(id) %>%
mutate(value = as.Date(value)) %>%
summarise(years = time_length(difftime(last(value), first(value)),"years"),
days = as.numeric(difftime(last(value), first(value))))
#> # A tibble: 4 x 3
#> id years days
#> <int> <dbl> <dbl>
#> 1 1 2.99 1093
#> 2 2 2.00 731
#> 3 3 3.01 1098
#> 4 4 1.01 369