Search code examples
rdataframedatelubridatedifftime

Time difference calculated from wide data with missing rows


There is a longitudinal data set in the wide format, from which I want to compute time (in years and days) between the first observation date and the last date an individual was observed. Dates are in the format yyyy-mm-dd. The data set has four observation periods with missing dates, an example is as follows

df1<-data.frame("id"=c(1:4),
           "adate"=c("2011-06-18","2011-06-18","2011-04-09","2011-05-20"),
           "bdate"=c("2012-06-15","2012-06-15",NA,"2012-05-23"),
           "cdate"=c("2013-06-18","2013-06-18","2013-04-09",NA),
           "ddate"=c("2014-06-15",NA,"2014-04-11",NA))

Here "adate" is the first date and the last date is the date an individual was last seen. To compute the time difference (lastdate-adate), I have tried using "lubridate" package, for example

lubridate::time_length(difftime(as.Date("2012-05-23"), as.Date("2011-05-20")),"years") 

However, I'm challenged by the fact that the last date is not coming from one column. I'm looking for a way to automate the calculation in R. The expected output would look like

  id years days
1  1  2.99 1093
2  2  2.00  731
3  3  3.01 1098
4  4  1.01  369 

Years is approximated to 2 decimal places.


Solution

  • Another tidyverse solution can be done by converting the data to long format, removing NA dates, and getting the time difference between last and first date for each id.

    library(dplyr)
    library(tidyr)
    library(lubridate)
    
    df1 %>% 
      pivot_longer(-id) %>% 
      na.omit %>% 
      group_by(id) %>% 
      mutate(value = as.Date(value)) %>% 
      summarise(years = time_length(difftime(last(value), first(value)),"years"),
                days = as.numeric(difftime(last(value), first(value))))
    
    #> # A tibble: 4 x 3
    #>      id years  days
    #>   <int> <dbl> <dbl>
    #> 1     1  2.99  1093
    #> 2     2  2.00   731
    #> 3     3  3.01  1098
    #> 4     4  1.01   369