Search code examples
rdplyrtime-series

Check logic of survival timeseries


I have a data frame containing survival data (1 = alive, 0 = dead) for many unique individuals over a long time period: Each indvidual (i.e. each row) has a series of 0s and 1s (every column represents a survival check whithin the time period). I would now like to check the logic of each individuals time series, i.e. logical would be if an individual stayed alive the whole time or if it died along the way and stayed dead. Illogical would be if an individual was dead, but is alive again at a later timepoint, i.e. it revived. I would like to add a new column to my data frame displaying whether an individuals time series is "ok" or "not ok".

I could use dplyr::case_when()and specify all possibilities, but as my time series is quite long this is not really practicable.

Is there a neat way (ideally using dplyr, but everything is great) to test the logic of such a timeseries?

Desired output:

# A tibble: 6 × 7
  ind_ID timeperiod_1 timeperiod_2 timeperiod_3 timeperiod_4 timeperiod_5 logic_status
  <chr>  <fct>        <fct>        <fct>        <fct>        <fct>        <chr>       
1 ID_1   1            1            1            1            1            ok          
2 ID_2   1            0            1            0            0            not ok      
3 ID_3   1            1            1            1            1            ok          
4 ID_4   1            1            1            0            0            ok          
5 ID_5   1            0            1            0            0            not ok      
6 ID_6   1            0            1            0            0            not ok    

example data:

dput(dat)
structure(list(ind_ID = c("ID_1", "ID_2", "ID_3", "ID_4", "ID_5", 
"ID_6"), timeperiod_1 = structure(c(2L, 2L, 2L, 2L, 2L, 2L), levels = c("0", 
"1"), class = "factor"), timeperiod_2 = structure(c(2L, 1L, 2L, 
2L, 1L, 1L), levels = c("0", "1"), class = "factor"), timeperiod_3 = structure(c(2L, 
2L, 2L, 2L, 2L, 2L), levels = c("0", "1"), class = "factor"), 
    timeperiod_4 = structure(c(2L, 1L, 2L, 1L, 1L, 1L), levels = c("0", 
    "1"), class = "factor"), timeperiod_5 = structure(c(2L, 1L, 
    2L, 1L, 1L, 1L), levels = c("0", "1"), class = "factor")), row.names = c(NA, 
-6L), class = c("tbl_df", "tbl", "data.frame"))

Solution

  • This is a good use case for c_across, which when combined with rowwise, allows you to treat multiple cells within each row as vectors.

    Since your alive / dead columns are binary, they can be switched from factors to simple integers. If we then diff each row, then we will get a vector of values indicating the transition between consecutive columns: a -1 if someone goes from alive to dead, a +1 if they go from dead to alive, and a 0 if they stay alive or if they stay dead.

    The only illogical rows are the rows in which one or more +1 values appear, so any(diff(c_across(-1))) will return TRUE if the column is illogical and FALSE otherwise. For completeness we can put this in an ifelse to produce the desired 'ok` / 'not ok' output.

    library(tidyverse)
    
    rowwise(dat) %>%
      mutate(across(-1, ~as.numeric(as.character(.x)))) %>%
      mutate(logical = ifelse(any(diff(c_across(-1)) > 0), 'not ok', 'ok')) %>%
      ungroup()
    #> # A tibble: 6 x 7
    #>   ind_ID timeperiod_1 timeperiod_2 timeperiod_3 timeperiod_4 timeperiod_5 logical
    #>   <chr>         <dbl>        <dbl>        <dbl>        <dbl>        <dbl> <chr>  
    #> 1 ID_1              1            1            1            1            1 ok     
    #> 2 ID_2              1            0            1            0            0 not ok 
    #> 3 ID_3              1            1            1            1            1 ok     
    #> 4 ID_4              1            1            1            0            0 ok 
    #> 5 ID_5              1            0            1            0            0 not ok 
    #> 6 ID_6              1            0            1            0            0 not ok
    

    We can see this has correctly identified ID 2, 5 and 6 as "not ok" because in each case they were dead at timepoint 2 but alive at timepoint 3. ID1 and ID3 are correctly labeled as 'ok' because they were alive throughout, and ID4 is 'ok' because they died without resurrection.