I have seen other questions similar to this but they do not answer my question. I want to expand my dataset as I need to create a time-varying variable for survival analysis and want to use survSplit
command (survival
package) but my data is already partially in long format. Example data:
data1<-structure(list(id = c(1, 1, 1, 1, 5, 5, 5, 5, 5, 7, 7, 7, 7,
7, 7), start = c(0, 183, 210, 241, 0, 183, 187, 212, 244, 0,
118, 139, 188, 212, 237), no_days = c(NA, 28L, 28L, 28L, NA,
7L, 28L, 28L, 28L, NA, 28L, 28L, 28L, 28L, 28L), stop = c(NA,
211, 238, 269, NA, 190, 215, 240, 272, NA, 146, 167, 216, 240,
265), drug = c(0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1),
dead = c(0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1)), .Names = c("id",
"start", "no_days", "stop", "drug", "dead"), row.names = c(NA,
15L), class = "data.frame")
> head(data1,15)
id start no_days stop drug dead
1 1 0 NA NA 0 0
2 1 183 28 211 1 0
3 1 210 28 238 1 0
4 1 241 28 269 1 1
5 5 0 NA NA 0 0
6 5 183 7 190 1 0
7 5 187 28 215 1 0
8 5 212 28 240 1 0
9 5 244 28 272 1 1
10 7 0 NA NA 0 0
11 7 118 28 146 1 0
12 7 139 28 167 1 0
13 7 188 28 216 1 0
14 7 212 28 240 1 0
15 7 237 28 265 1 1
is the day the drug was prescribed, no_days
is how long the prescription was for, drug
indicates whether a person was on the drug for the given time period (this is the variable I need to make time-varying), dead
indicates when a person died. At the moment the dataset only contains times an individual was on the drug so the final dataset I want should look like this:
id start no_days stop drug dead
1 1 0 NA 182 0 0
2 1 183 28 211 1 0
3 1 210 28 238 1 0
4 1 239 NA 240 0 0
5 1 241 28 269 1 1
6 5 0 NA 182 0 0
7 5 183 7 190 1 0
8 5 187 28 215 1 0
9 5 212 28 240 1 0
10 5 241 NA 243 0 0
11 5 244 28 272 1 1
12 7 0 NA 117 0 0
13 7 118 28 146 1 0
14 7 139 28 167 1 0
15 7 168 NA 187 0 0
16 7 188 28 216 1 0
17 7 212 28 240 1 0
18 7 237 28 265 1 1
Maybe this should be a standard data manipulation problem where I need to add more rows based on a certain criteria but considering it is survival data and survSplit
was designed for this, albeit in a slightly different data structure to begin I was wondering is there an easy way to use survSplit
to solve my problem. If not, does anyone have a simple suggestion to expand the dataframe.
My ultimate step is to fit a cox model something like:
coxph(Surv(data$start,data$stop,data$dead)~covariates + drug +cluster(id),data=data1)
Thanks for any suggestions.
Consider the following data wrangling with base R where essentially you merge
dataframe with itself shifted by one row to align current and next record and then transform
for start and stop calculations.
Note: merge
will raise a warning (not error) of the duplicate nextidcnt
column. Either ignore or create a second data1
for the merge using id
and idcnt
(shifted one in new df) as join keys.
data1$idcnt <- sapply(1:nrow(data1), function(i) sum(data1[1:i, c("id")] == data1$id[i]))
data1$nextidcnt <- data1$idcnt + 1
dfm <- merge(data1, data1, by.x=c("id", "nextidcnt"), by.y=c("id", "idcnt"))
dfm <- transform(dfm,
start = ifelse(is.na(stop.x), start.x, stop.x + 1),
no_days = no_days.x,
stop = start.y - 1,
drug = 0,
dead = dead.x)
finaldf <- rbind(data1[data1$start != 0, c(1:6)],
dfm[dfm$start < dfm$stop,
c("id", "start", "no_days", "stop", "drug", "dead")])
finaldf <- finaldf[with(finaldf, order(id, start, stop)),] # ORDER BY ID, START, STOP
rownames(finaldf) <- NULL # RESET ROW NAMES
# id start no_days stop drug dead
# 1 1 0 NA 182 0 0
# 2 1 183 28 211 1 0
# 3 1 210 28 238 1 0
# 4 1 239 28 240 0 0
# 5 1 241 28 269 1 1
# 6 5 0 NA 182 0 0
# 7 5 183 7 190 1 0
# 8 5 187 28 215 1 0
# 9 5 212 28 240 1 0
# 10 5 241 28 243 0 0
# 11 5 244 28 272 1 1
# 12 7 0 NA 117 0 0
# 13 7 118 28 146 1 0
# 14 7 139 28 167 1 0
# 15 7 168 28 187 0 0
# 16 7 188 28 216 1 0
# 17 7 212 28 240 1 0
# 18 7 237 28 265 1 1