Search code examples
rsurvival-analysissurvminer

Why does the survival probability of the survival package return 0% at the end of the time horizon when there are survivors in the dataset?


I've just started using the survival and survminer packages in R and am trying to understand its output. In the code below I create a dataframe with the first 12 rows of my actual dataset, as representative of the issue/question. In this representative data:

  • ID = unique identifier for each element
  • time = survival time for the element in months where value > 0 means death (the month that death occurs) and value = 0 means no death (right censored) during the study period
  • status = the element's censoring status where 1=censored and 2=dead
  • node = one of the variables associated with each element where I try to assess its association with the probability of death

Running length(which(testDF$status == 2))/nrow(testDF) shows a death rate of 66.67% with this data, but the survival probability curves shown in the image below end at 0%. Should they not be ending at 66.67% at least for the average of all the data? What am I doing wrong here or am I misinterpreting survival probability?

enter image description here

Code:

library(ggplot2)
library(survival)
library(survminer)

testDF <- data.frame(
  ID = 1:12,
  time = c(0,34,0,12,12,21,0,0,39,11,13,26),
  status = c(1,2,1,2,2,2,1,1,2,2,2,2),
  node = c("C","C","B","A","C","C","B","C","B","C","A","B")
)

fit <- survfit(Surv(time, status) ~ node, data = testDF)

ggsurvplot(fit,
           pval = TRUE, 
           conf.int = TRUE,
           linetype = "strata",
           surv.median.line = "hv",
           ggtheme = theme_bw()
           )

# percentage of deaths
length(which(testDF$status == 2))/nrow(testDF)

Solution

  • My coding of censored (no death, the survivors) observations in the "time" column with 0's was incorrect as Edward points out in his comments. Now I recode those survivor observations with the time length of the study of 40. I also re-run the plot without confidence intervals for solution clarity.

    testDF <- data.frame(
      ID = 1:12,
      time = c(40,34,40,12,12,21,40,40,39,11,13,26), # 40 month study window (0's for no death changed to 40)
      status = c(0,1,0,1,1,1,0,0,1,1,1,1), # 0 = censored, 1 = death
      node = c("C","C","B","A","C","C","B","C","B","C","A","B")
    )
    
    # survival rates: total = 33.3%, node A = 0%, node B = 50%, node C = 33.3%
    length(which(testDF$status == 0))/nrow(testDF)
    length(which(testDF$status == 0 & testDF$node == "A"))/length(which(testDF$node == "A"))
    length(which(testDF$status == 0 & testDF$node == "B"))/length(which(testDF$node == "B"))
    length(which(testDF$status == 0 & testDF$node == "C"))/length(which(testDF$node == "C"))
    

    Plot running the above revised DF:

    enter image description here