Search code examples
rcox-regressionsurvival

Get predictions from coxph


# Create the simplest test data set 
test1 <- list(time=c(4,3,1,1,2,2,3), 
              status=c(1,1,1,0,1,1,0), 
              x=c(0,2,1,1,1,0,0), 
              sex=c(0,0,0,0,1,1,1)) 
# Fit a stratified model 
m=coxph(Surv(time, status) ~ x + sex, test1) 

y=predict(m,type="survival",by="sex")

Basically what I am doing is making fake data called test1, then I am fitting a simple coxph model and saving it as 'm'. Then what I aim to do is get the predicted probabilities and confidence bands for the survival probability separate for sexes. My hopeful dataset 'y' will include: age, survival probability, lower confidence band, upper confidence band, and sex which equals to '0' or '1'.


Solution

  • This can be accomplished in two ways. The first is a slight modification to your code, using the predict() function to get predictions at a specific times for specific combinations of covariates. The second is by using the survfit() function, which estimates the entire survival curve and is easy to plot. The confidence intervals don't exactly agree as we'll see, but they should match fairly closely as long as the probabilities aren't too close to 1 or 0.

    Below is code to both make the predictions as your code tries. It uses the built-in cancer data. The important difference is to create a newdata which has the covariate values you're interested in. Because of the non-linear nature of survival probabilities it is generally a bad idea to try and make a prediction for the "average person". Because we want to get a survival probability we must also specify what time to consider that probability. I've taken time = 365, age = 60, and both sex = 1 and sex = 2 So this code predicts the 1-year survival probability for a 60 year old male and a 60 year old female. Note that we must also include status in the newdata, even though it doesn't affect the result.

    library(survival)
    mod <- coxph(Surv(time,status) ~ age + sex, data = cancer)
    pred_dat <- data.frame(time = c(365,365), status = c(2,2),
                           age = c(60,60), sex = c(1,2))
    preds <- predict(mod, newdata = pred_dat,
                     type = "survival", se.fit = TRUE)
    pred_dat$prob <- preds$fit
    pred_dat$lcl <- preds$fit - 1.96*preds$se.fit
    pred_dat$ucl <- preds$fit + 1.96*preds$se.fit
    pred_dat
    #>   time status age sex      prob       lcl       ucl
    #> 1  365      2  60   1 0.3552262 0.2703211 0.4401313
    #> 2  365      2  60   2 0.5382048 0.4389833 0.6374264
    

    We see that for a 60 year old male the 1 year survival probability is estimated as 35.5%, while for a 60 year old female it is 53.8%.

    Below we estimate the entire survival curve using survfit(). I've saved time by reusing the pred_dat from above, and because the plot gets messy I've only plotted the male curve, which is the first row. I've also added some flair, but you only need the first 2 lines.

    fit <- survfit(mod, newdata = pred_dat[1,])
    plot(fit, conf.int = TRUE)
    title("Estimated survival probability for age 60 male")
    abline(v = 365, col = "blue")
    abline(h = pred_dat[1,]$prob, col = "red")
    abline(h = pred_dat[1,]$lcl, col = "orange")
    abline(h = pred_dat[1,]$ucl, col = "orange")
    

    Created on 2022-06-09 by the reprex package (v2.0.1)

    I've overlaid lines corresponding to the predicted probabilities from part 1. The red line is the estimated survival probability at day 365 and the orange lines are the 95% confidence interval. The predicted survival probability matches, but if you squint closely you'll see the confidence interval doesn't match exactly. That's generally not a problem, but if it is a problem you should trust the ones from survfit() instead of the ones calculated from predict().

    You can also dig into the values of fit to extract fitted probabilities and confidence bands, but the programming is a little more complicated because the desired time doesn't usually match exactly.