Using the survival
package in R, we can use the "heart" dataset:
survfit(Surv(stop, event) ~ transplant, data = heart)
This outputs a model has n=172 (103 in the transplant=1 group; and 69 in the transplant=1 group) and 75 events (30 in treatment=0; 45 in treatment=1).
And if we plot the K-M curve with survminer
ggsurvplot(survfit(Surv(stop, event) ~ transplant, data = heart), risk.table = "nrisk_cumcensor", xlim=c(0,5*365), = 365,
It shows that there are 103 and 69 individuals at risk to start with in each transplant group.
However, there are only 103 individuals in total (length(unique(heart$id))
), not 172.
Trying to force the id with either "id" or "cluster" (eg survfit(Surv(stop, event) ~ transplant, id=id, cluster=id, data = heart)
) doesn't change the result.
How can we make the model understand there are multiple lines for each individual?
For this I would recommend looking into time-dependent cox regression, there is a good vignette in the survival package (
There are several ways you can account for the multiple observations per patient, the simplest way with time-dependent cox regression will assume that the covariates are constant until the next observation. In this case for each observation, you define a start and stop time (until the next observation) and a status indicator that indicates if an event occurred during that time. The data would look similar to:
id time1 time2 status
1 1 0 30 0
2 1 30 100 1
And the cox-regression could then take the form
coxph(Surv(time1, time2, status) ~ ., cluster = id, data=df)
There are other more sophisticated methods for these analyses such as using multivariate models (so-called Joint-Models), for which there are other packages such as JM,