Search code examples
rdataframesurvival-analysis

R draw survival curve and calculate P-value at specific times


I am trying to figure out how to generate a survival curve and calculate P-value of a specific time point and not of the entire survival curve.

I use the surv and survfit methods from packages survminer, survival to create the survival object and ggsurvplot to draw the curve and it's p-value.

df_surv <- Surv(time = df$diff_in_days, event = df$survivalstat)
df_survfit <- survfit(dat_surv ~ Schedule, data = df)

ggsurvplot(
  df_survfit , 
  data = df,
  pval = TRUE
)

Now it calculates the p-value over the entire curve of 2500+ days. I would also like to calculate the P-value at exact intervals. Let's say I would like to know the survival probability at / up-to 365 days.

I can't simply cut-off all records which have survival times longer than x (e.g. 365) days, as below. Then all survival probability drops to 0% since subjects who had the event occur later than 365 aren't taken into account.

the event hasn't there also isn't anybody alive beyond x days anymore.

df <- df[df$diff_in_days <= 365, ]

How can I calculate the P-value at a specific time from the overall curve?

The dput(head(df) of my dataframe for reproducible example.

structure(list(diff_in_days = structure(c(2160, 84, 273, 1245, 
2175, 114), class = "difftime", units = "days"), Schedule = c(1, 
1, 1, 2, 2, 2), survivalstat = c(0, 1, 1, 0, 1, 1)), row.names = c(12L, 
28L, 33L, 38L, 58L, 62L), class = "data.frame")

My dataframe

  • UID (each row is a new entry)
  • Event occurrence no/yes (0,1)
  • Integer amount of days till event happened (if occurence didnt happen yet, the days from start of monitoring till current is calculated (right-censoring))

EDIT:

using following code to set everybody's event occurance to 0 after 365 days.

dat$survivalstat <- ifelse(dat$diff_in_days > 365, 0, dat$survivalstat)

It does calculate the p-value but still over the entire curve. After 365 days it stays horizontal till the end at 2500+ days (since no events occur) and those events after 365 days are all still taken into account because theyre still in the curve. (I assume that even tho all the datapoints after 365 are the same, they still affect the P-value?)


Solution

  • If you want a p-value at a specific time point you can do a z-test at a particular time point. In my example below I used the lung data set from the survival package. For better help to see if this method is appropriate I would post this question on cross validated.

    library(survival)
    library(dplyr)
    library(broom)
    library(ggplot2)
    fit1 <- survfit(Surv(time,status)~sex,data = lung)
              #turn into df
    df <- broom::tidy(fit1) 
    
    fit_df <- df  %>% 
              #group by strata
              group_by(strata) %>% 
              #get day  of interest or day before it
              filter(time <= 365) %>% 
              arrange(time) %>% 
              # pulls last date
              do(tail(.,1))
    
    #calculate z score based on 2 sample test at that time point
    z <- (fit_df$estimate[1]-fit_df$estimate[2]) /
          (sqrt( fit_df$std.error[1]^2+ fit_df$std.error[2]^2))
    #get probability of z score
    pz <- pnorm(abs(z))
    #get p value
    pvalue <- round(2 * (1-pz),2)
    
    
    
    ggplot(data = df,  aes(x=time, y=estimate, group=strata, color= strata)) +
      geom_line(size = 1.5)+
      geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.2)+
      geom_vline(aes(xintercept=365))+
      geom_text(aes(x = 500,y=.8,label = paste0("p = " ,pvalue) ))+
      scale_y_continuous("Survival",
                         limits = c(0,1))+
      scale_x_continuous("Time")+
      scale_color_manual(" ", values = c("grey", "blue"))+
      scale_fill_discrete(guide = FALSE)+
      theme(axis.text.x = element_text(angle = 45, hjust = 1, size=14),
            axis.title.x = element_text(size =14),
            axis.text.y = element_text(size = 14),
            strip.text.x = element_text(size=14),
            axis.title.y = element_blank())+
      theme_bw()
    

    enter image description here

    Update - getting p-value up to a specific time point using log rank

    #First censor and make follow time to the time point of interest 
    lung2 <- lung %>% 
              mutate(time2 = ifelse(time >= 365, 365, time),
                     status2 = ifelse(time >= 365, 1,status))
    #Compute log rank test using survdiff
    sdf <- survdiff(Surv(time2,status2)~sex,data = lung2)
    #extract p-value
    p.val <- round(1 - pchisq(sdf$chisq, length(sdf$n) - 1),3)
    

    In the ggplot code above you can replace pvalue with p.val so it shows the log rank score.