I'm looking for ways to uses tree-like algoritms to perform a survival analysis on left-truncated, right censored data. I tried the packages ipred
and pec
, but the functions ipredbagg
and pecCforest
seem only to work without left truncation.
Data description
My data looks a lot like the heart dataset from the Stanford Heart Transplant data. The objects are in reality at risk from t=0, but some objects (for me the huge majority) enter the survey only at a later time t1, so when they would have died at t < t1 they would not have entered the dataset. It has been shown that when this left truncation is neglected, huge errors can arise, see e.g. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3121224/ Besides the left truncation my dataset is also right sensored, just as the heart dataset, so some objects leave the dataset without having died/shown the relevant event.
The heart dataset looks like this
Surv(heart$start, heart$stop, heart$event)
[1] ( 0.0, 50.0] ( 0.0, 6.0] ( 0.0, 1.0+] ( 1.0, 16.0] ( 0.0, 36.0+] ( 36.0, 39.0]
[7] ( 0.0, 18.0] ( 0.0, 3.0] ( 0.0, 51.0+] ( 51.0, 675.0] ( 0.0, 40.0] ( 0.0, 85.0]
[13] ( 0.0, 12.0+] ( 12.0, 58.0] ( 0.0, 26.0+] ( 26.0, 153.0] ( 0.0, 8.0] ( 0.0, 17.0+]
[19] ( 17.0, 81.0] ( 0.0, 37.0+] ( 37.0,1387.0] ( 0.0, 1.0] ( 0.0, 28.0+] ( 28.0, 308.0]
[25] ( 0.0, 36.0] ( 0.0, 20.0+] ( 20.0, 43.0] ( 0.0, 37.0] ( 0.0, 18.0+] ( 18.0, 28.0]
[31] ( 0.0, 8.0+] ( 8.0,1032.0] ( 0.0, 12.0+] ( 12.0, 51.0] ( 0.0, 3.0+] ( 3.0, 733.0]
[37] ( 0.0, 83.0+] ( 83.0, 219.0] ( 0.0, 25.0+] ( 25.0,1800.0+] ( 0.0,1401.0+] ( 0.0, 263.0]
[43] ( 0.0, 71.0+] ( 71.0, 72.0] ( 0.0, 35.0] ( 0.0, 16.0+] ( 16.0, 852.0] ( 0.0, 16.0]
[49] ( 0.0, 17.0+] ( 17.0, 77.0] ( 0.0, 51.0+] ( 51.0,1587.0+] ( 0.0, 23.0+] ( 23.0,1572.0+]
[55] ( 0.0, 12.0] ( 0.0, 46.0+] ( 46.0, 100.0] ( 0.0, 19.0+] ( 19.0, 66.0] ( 0.0, 4.5+]
[61] ( 4.5, 5.0] ( 0.0, 2.0+] ( 2.0, 53.0] ( 0.0, 41.0+] ( 41.0,1408.0+] ( 0.0, 58.0+]
[67] ( 58.0,1322.0+] ( 0.0, 3.0] ( 0.0, 2.0] ( 0.0, 40.0] ( 0.0, 1.0+] ( 1.0, 45.0]
[73] ( 0.0, 2.0+] ( 2.0, 996.0] ( 0.0, 21.0+] ( 21.0, 72.0] ( 0.0, 9.0] ( 0.0, 36.0+]
[79] ( 36.0,1142.0+] ( 0.0, 83.0+] ( 83.0, 980.0] ( 0.0, 32.0+] ( 32.0, 285.0] ( 0.0, 102.0]
[85] ( 0.0, 41.0+] ( 41.0, 188.0] ( 0.0, 3.0] ( 0.0, 10.0+] ( 10.0, 61.0] ( 0.0, 67.0+]
[91] ( 67.0, 942.0+] ( 0.0, 149.0] ( 0.0, 21.0+] ( 21.0, 343.0] ( 0.0, 78.0+] ( 78.0, 916.0+]
[97] ( 0.0, 3.0+] ( 3.0, 68.0] ( 0.0, 2.0] ( 0.0, 69.0] ( 0.0, 27.0+] ( 27.0, 842.0+]
[103] ( 0.0, 33.0+] ( 33.0, 584.0] ( 0.0, 12.0+] ( 12.0, 78.0] ( 0.0, 32.0] ( 0.0, 57.0+]
[109] ( 57.0, 285.0] ( 0.0, 3.0+] ( 3.0, 68.0] ( 0.0, 10.0+] ( 10.0, 670.0+] ( 0.0, 5.0+]
[115] ( 5.0, 30.0] ( 0.0, 31.0+] ( 31.0, 620.0+] ( 0.0, 4.0+] ( 4.0, 596.0+] ( 0.0, 27.0+]
[121] ( 27.0, 90.0] ( 0.0, 5.0+] ( 5.0, 17.0] ( 0.0, 2.0] ( 0.0, 46.0+] ( 46.0, 545.0+]
[127] ( 0.0, 21.0] ( 0.0, 210.0+] (210.0, 515.0+] ( 0.0, 67.0+] ( 67.0, 96.0] ( 0.0, 26.0+]
[133] ( 26.0, 482.0+] ( 0.0, 6.0+] ( 6.0, 445.0+] ( 0.0, 428.0+] ( 0.0, 32.0+] ( 32.0, 80.0]
[139] ( 0.0, 37.0+] ( 37.0, 334.0] ( 0.0, 5.0] ( 0.0, 8.0+] ( 8.0, 397.0+] ( 0.0, 60.0+]
[145] ( 60.0, 110.0] ( 0.0, 31.0+] ( 31.0, 370.0+] ( 0.0, 139.0+] (139.0, 207.0] ( 0.0, 160.0+]
[151] (160.0, 186.0] ( 0.0, 340.0] ( 0.0, 310.0+] (310.0, 340.0+] ( 0.0, 28.0+] ( 28.0, 265.0+]
[157] ( 0.0, 4.0+] ( 4.0, 165.0] ( 0.0, 2.0+] ( 2.0, 16.0] ( 0.0, 13.0+] ( 13.0, 180.0+]
[163] ( 0.0, 21.0+] ( 21.0, 131.0+] ( 0.0, 96.0+] ( 96.0, 109.0+] ( 0.0, 21.0] ( 0.0, 38.0+]
[169] ( 38.0, 39.0+] ( 0.0, 31.0+] ( 0.0, 11.0+] ( 0.0, 6.0]
So at the first time in each interval the object enters my set and starts to be "at risk". At the second time the object leaves the set, some because the interesting event happens (without a '+') and others are censored (with '+').
Cox regression
For the Cox regression everything works fine. This Surv-object created above can be used to perform a Cox regression.
coxtime=coxph(Surv(heart$start, heart$stop, heart$event)~1,data=heart)
summary(coxtime)
Call: coxph(formula = Surv(heart$start, heart$stop, heart$event) ~
1, data = heart)
Null model
log likelihood= -298.1214
n= 172
I can also plot the survival function
plot(survfit(coxtime),xscale=365.25, xlab = "Years", ylab="Survival")
Survival function of heart dataset Now I want to perform the same analysis with a tree-like algorithm.
ipredbagg
When I try the ippredbag
-function, this function works fine without the left truncation:
library(survival)
library(ipred)
#without left truncation
ipredbagg(Surv(heart$stop, heart$event) ,X=heart$surgery)
I get the result
Bagging survival trees with 25 bootstrap replications
Because there are rows in the heart set where the start value is 0, I get an error when I simply enter the start and stop into the ipredbagg
function.
#with left truncation
ipredbagg(Surv(heart$start, heart$stop, heart$event) ,X=heart$surgery)
Error in get(paste("rpart", method, sep = "."), envir = environment())(Y, :
Observation time must be > 0
Therefore I add one to both the start and stop columns, but now I get another error.
#with left truncation and start > 0
ipredbagg(Surv(heart$start+1, heart$stop+1, heart$event) ,X=heart$surgery)
The error:
Error in table(index2, levels = 1:ngrp) :
all arguments must have the same length
pecCforest
My second try is the pecCforest
-function from the pec
-package. This function also works on the data without left truncation.
library(pec)
library(party)
library(survival)
#without left truncation
fitcforest <- pecCforest(Surv(stop, event) ~ .,data=heart[,-which(names(heart)=="start")],
controls = cforest_classical(ntree=100),mtry=2);
predictSurvProb(fitcforest,heart[1,],times=1)
I get the result
[,1]
[1,] 0.9890595
This time I can train the model on the start + stop columns without an error, but I'm unable to get a prediction out of it.
#with left truncation
fitcforest <- pecCforest(Surv((start), (stop), event) ~ .,data=heart,
controls = cforest_classical(ntree=100));
predictSurvProb(fitcforest,heart[1,],times=1)
This results in
Error in predict.survfit(object, newdata = newdata, times = times, bytimes = TRUE, :
Predictions only available
for class 'survfit', possibly stratified Kaplan-Meier fits.
For class 'cph' Cox models see survest.cph.
Adding 1 to both columns start and stop results in the same error.
#with left truncation and start > 0
fitcforest <- pecCforest(Surv((start+1), (stop+1), event) ~ .,data=heart,
controls = cforest_classical(ntree=100,mtry=2));
predictSurvProb(fitcforest,heart[1,],times=1)
Error in predict.survfit(object, newdata = newdata, times = times, bytimes = TRUE, :
Predictions only available
for class 'survfit', possibly stratified Kaplan-Meier fits.
For class 'cph' Cox models see survest.cph.
Is there a way to get these functions work on left-truncated data? It seems that left truncation is not implemented in both functions, but I cannot find information about that. Is there an alternative way to do survivalanalysis on left truncated data in R with an algoritm based on trees (I managed to do a standard Cox regression)?
You can try recent package LTRCtrees on CRAN, which is designed for left-truncated survival data