I’m in the process of cleaning some data for a survival analysis and I am trying to make it so that an individual only has a single, sustained, transition from symptom present (ss=1) to symptom remitted (ss=0). An individual must have a complete sustained remission in order for it to count as a remission. Statistical problems/issues aside, I’m wondering how I can go about addressing the issues detailed below.
I’ve been trying to break the problem apart into smaller, more manageable operations and objects, however, the solutions I keep coming to force me to use conditional formatting based on rows immediately above and below the a missing value and, quite frankly, I’m at a bit of a loss as to how to do this. I would love a little guidance if you think you know of a good technique I can use, experiment with, or if you know of any good search terms I can use when looking up a solution.
The details are below:
#Fake dataset creation
id <- c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3,4,4,4,4,4,4,4)
time <-c(0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6)
ss <- c(1,1,1,1,NA,0,0,1,1,0,NA,0,0,0,1,1,1,1,1,1,NA,1,1,0,NA,NA,0,0)
mydat <- data.frame(id, time, ss)
*Bold and underlined characters represent changes from the dataset above
The goal here is to find a way to get the NA values for ID #1 (variable ss) to look like this: 1,1,1,1,1,0,0
ID# 2 (variable ss) to look like this: 1,1,0,0,0,0,0
ID #3 (variable ss) to look like this: 1,1,1,1,1,1,NA (no change because the row with NA will be deleted eventually)
ID #4 (variable ss) to look like this: 1,1,1,1,1,0,0 (this one requires multiple changes and I expect it is the most challenging to tackle).
I don't really think you have considered all the "edge case". What to do with two NA's in a row at the end of a period or 4 or 5 NA's in a row. This will give you the requested solution in your tiny test case, however, using the na.locf
-function:
require(zoo)
fillNA <- function(vec) { if ( is.na(tail(vec, 1)) ){ vec } else { vec <- na.locf(vec) }
}
> mydat$locf <- with(mydat, ave(ss, id, FUN=fillNA))
> mydat
id time ss locf
1 1 0 1 1
2 1 1 1 1
3 1 2 1 1
4 1 3 1 1
5 1 4 NA 1
6 1 5 0 0
7 1 6 0 0
8 2 0 1 1
9 2 1 1 1
10 2 2 0 0
11 2 3 NA 0
12 2 4 0 0
13 2 5 0 0
14 2 6 0 0
15 3 0 1 1
16 3 1 1 1
17 3 2 1 1
18 3 3 1 1
19 3 4 1 1
20 3 5 1 1
21 3 6 NA NA
22 4 0 1 1
23 4 1 1 1
24 4 2 0 0
25 4 3 NA 0
26 4 4 NA 0
27 4 5 0 0
28 4 6 0 0