Search code examples
rsurvival-analysisdata-cleaning

Data Cleaning for Survival Analysis


I’m in the process of cleaning some data for a survival analysis and I am trying to make it so that an individual only has a single, sustained, transition from symptom present (ss=1) to symptom remitted (ss=0). An individual must have a complete sustained remission in order for it to count as a remission. Statistical problems/issues aside, I’m wondering how I can go about addressing the issues detailed below.

I’ve been trying to break the problem apart into smaller, more manageable operations and objects, however, the solutions I keep coming to force me to use conditional formatting based on rows immediately above and below the a missing value and, quite frankly, I’m at a bit of a loss as to how to do this. I would love a little guidance if you think you know of a good technique I can use, experiment with, or if you know of any good search terms I can use when looking up a solution.

The details are below:

#Fake dataset creation
id <- c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3,4,4,4,4,4,4,4)
time <-c(0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6)
ss <- c(1,1,1,1,NA,0,0,1,1,0,NA,0,0,0,1,1,1,1,1,1,NA,1,1,0,NA,NA,0,0)
mydat <- data.frame(id, time, ss)

*Bold and underlined characters represent changes from the dataset above

The goal here is to find a way to get the NA values for ID #1 (variable ss) to look like this: 1,1,1,1,1,0,0

ID# 2 (variable ss) to look like this: 1,1,0,0,0,0,0

ID #3 (variable ss) to look like this: 1,1,1,1,1,1,NA (no change because the row with NA will be deleted eventually)

ID #4 (variable ss) to look like this: 1,1,1,1,1,0,0 (this one requires multiple changes and I expect it is the most challenging to tackle).


Solution

  • I don't really think you have considered all the "edge case". What to do with two NA's in a row at the end of a period or 4 or 5 NA's in a row. This will give you the requested solution in your tiny test case, however, using the na.locf-function:

    require(zoo)
    fillNA <- function(vec) { if ( is.na(tail(vec, 1)) ){ vec } else { vec <- na.locf(vec) }
                             }
    
    > mydat$locf <- with(mydat, ave(ss, id, FUN=fillNA))
    > mydat
       id time ss locf
    1   1    0  1    1
    2   1    1  1    1
    3   1    2  1    1
    4   1    3  1    1
    5   1    4 NA    1
    6   1    5  0    0
    7   1    6  0    0
    8   2    0  1    1
    9   2    1  1    1
    10  2    2  0    0
    11  2    3 NA    0
    12  2    4  0    0
    13  2    5  0    0
    14  2    6  0    0
    15  3    0  1    1
    16  3    1  1    1
    17  3    2  1    1
    18  3    3  1    1
    19  3    4  1    1
    20  3    5  1    1
    21  3    6 NA   NA
    22  4    0  1    1
    23  4    1  1    1
    24  4    2  0    0
    25  4    3 NA    0
    26  4    4 NA    0
    27  4    5  0    0
    28  4    6  0    0