Basically, using tmerge()
to create data for time-varying-covariate Cox regression, two ways of expressing times should give the same regression results (I think), but they don't.
One way uses start and end dates, and converts to numeric within Surv()
; the other just uses numeric days to event.
Example
First, create some data. We have an ID, an outcome (death
), a start date for each row, and an end date some time later. The start date and end date are Date
objects.
n <- 1000
set.seed(0)
dd <- data.frame(id=1:n,
death=sample(x=c(FALSE, TRUE), prob=c(8, 1), size=n, replace=TRUE),
startDate=as.Date(runif(min=as.numeric(as.Date("2000-01-01")),
max=as.numeric(as.Date("2019-12-31")), n=n), origin="1970-01-01"))
dd$endDate <- as.Date(as.numeric(dd$startDate) +
rnorm(mean=3650, sd=500, n=n), origin="1970-01-01")
# (You can check that endDate is never before startDate.)
Rather than a start and end date for each participant, we could alternatively start each person's time at zero and have a numeric number of days until event/censor:
dd$startDay <- 0
dd$endDay <- as.numeric(dd$endDate - dd$startDate)
Next, we use tmerge()
to transform the data into the format that would be needed for Cox regression with time-varying covariates. (Note: this is a minimal example that does not actually have any time-varying covariates.)
We do this two ways, to compare. 1) Using numeric days to event/censor; 2) Using dates.
Using days
ddTv <- tmerge(data1=dd, data2=dd, id=id,
tstart=startDay, tstop=endDay, event=event(endDay, death))
ddTv[1:6, ]
id death startDate endDate startDay endDay tstart tstop event
1 TRUE 2005-04-23 2014-11-29 0 3506.6 0 3506.6 TRUE
2 FALSE 2010-08-13 2023-02-16 0 4570.6 0 4570.6 FALSE
3 FALSE 2013-09-11 2023-06-22 0 3571.6 0 3571.6 FALSE
4 FALSE 2007-08-31 2015-10-03 0 2955.1 0 2955.1 FALSE
5 TRUE 2019-02-05 2027-01-27 0 2913.4 0 2913.4 TRUE
6 FALSE 2002-05-14 2012-04-06 0 3615.2 0 3615.2 FALSE
Using dates
ddTvDate <- tmerge(data1=dd, data2=dd, id=id,
tstart=startDate, tstop=endDate, event=event(endDate, death))
ddTvDate[1:6, ]
id death startDate endDate startDay endDay tstart tstop event
1 TRUE 2005-04-23 2014-11-29 0 3506.6 2005-04-23 2014-11-29 TRUE
2 FALSE 2010-08-13 2023-02-16 0 4570.6 2010-08-13 2023-02-16 FALSE
3 FALSE 2013-09-11 2023-06-22 0 3571.6 2013-09-11 2023-06-22 FALSE
4 FALSE 2007-08-31 2015-10-03 0 2955.1 2007-08-31 2015-10-03 FALSE
5 TRUE 2019-02-05 2027-01-27 0 2913.4 2019-02-05 2027-01-27 TRUE
6 FALSE 2002-05-14 2012-04-06 0 3615.2 2002-05-14 2012-04-06 FALSE
Finally, using these two ways of expressing the same data don't give the same regression results. We'll compare just the null model:
Using days
ddMod <- coxph(formula=
Surv(time=tstart, time2=tstop, event=death) ~ 1,
data=ddTv)
ddMod
Null model
log likelihood= -702.08
n= 1000
Using dates
ddModDate <- coxph(formula=
Surv(time=as.numeric(tstart), time2=as.numeric(tstop), event=death) ~ 1,
data=ddTvDate)
ddModDate
Null model
log likelihood= -681.85
n= 1000
Log-likelihoods are similar, but not the same.
Why are these not the same?
If you add covariates to the model then coefficients and p values between the two versions are again not the same.
Finally, if you don't use tmerge()
, and go straight to coxph()
on the original dataset, then both methods give you the same results. Both of these models
ddMod2 <- coxph(formula=
Surv(time=endDay, event=death) ~ 1,
data=dd)
ddMod2
ddModDate2 <- coxph(formula=
Surv(time=as.numeric(endDate - startDate), event=death) ~ 1,
data=dd)
ddModDate2
give the same results as ddMod
above, the version using days.
Professor Terry M. Therneau (creator of the survival
package) kindly gave me an answer, with permission to post here.
Paraphrasing ---
Basically, the results are different because those are two entirely different models. Consider, for example, a participant who had an event on the 100th day that they were in the study, on January 1, 2010.
If I use calendar dates for my times, then the risk set for that event is everyone who was in the study on January 1, 2010.
If I use time since entry for my times, then the risk set for that event is everyone who was still in the study on their 100th day since entry.
Those are probably very different sets of people!
For almost every study, time since entry is the measure you want.
Obvious once he points it out, opaque to me until then.