I am trying to figure out how to generate a survival curve and calculate P-value of a specific time point and not of the entire survival curve.
I use the surv
and survfit
methods from packages survminer
, survival
to create the survival object and ggsurvplot
to draw the curve and it's p-value.
df_surv <- Surv(time = df$diff_in_days, event = df$survivalstat)
df_survfit <- survfit(dat_surv ~ Schedule, data = df)
ggsurvplot(
df_survfit ,
data = df,
pval = TRUE
)
Now it calculates the p-value over the entire curve of 2500+ days. I would also like to calculate the P-value at exact intervals. Let's say I would like to know the survival probability at / up-to 365 days.
I can't simply cut-off all records which have survival times longer than x (e.g. 365) days, as below. Then all survival probability drops to 0% since subjects who had the event occur later than 365 aren't taken into account.
the event hasn't there also isn't anybody alive beyond x days anymore.
df <- df[df$diff_in_days <= 365, ]
How can I calculate the P-value at a specific time from the overall curve?
The dput(head(df)
of my dataframe for reproducible example.
structure(list(diff_in_days = structure(c(2160, 84, 273, 1245,
2175, 114), class = "difftime", units = "days"), Schedule = c(1,
1, 1, 2, 2, 2), survivalstat = c(0, 1, 1, 0, 1, 1)), row.names = c(12L,
28L, 33L, 38L, 58L, 62L), class = "data.frame")
My dataframe
EDIT:
using following code to set everybody's event occurance to 0 after 365 days.
dat$survivalstat <- ifelse(dat$diff_in_days > 365, 0, dat$survivalstat)
It does calculate the p-value but still over the entire curve. After 365 days it stays horizontal till the end at 2500+ days (since no events occur) and those events after 365 days are all still taken into account because theyre still in the curve. (I assume that even tho all the datapoints after 365 are the same, they still affect the P-value?)
If you want a p-value at a specific time point you can do a z-test at a particular time point. In my example below I used the lung data set from the survival package. For better help to see if this method is appropriate I would post this question on cross validated.
library(survival)
library(dplyr)
library(broom)
library(ggplot2)
fit1 <- survfit(Surv(time,status)~sex,data = lung)
#turn into df
df <- broom::tidy(fit1)
fit_df <- df %>%
#group by strata
group_by(strata) %>%
#get day of interest or day before it
filter(time <= 365) %>%
arrange(time) %>%
# pulls last date
do(tail(.,1))
#calculate z score based on 2 sample test at that time point
z <- (fit_df$estimate[1]-fit_df$estimate[2]) /
(sqrt( fit_df$std.error[1]^2+ fit_df$std.error[2]^2))
#get probability of z score
pz <- pnorm(abs(z))
#get p value
pvalue <- round(2 * (1-pz),2)
ggplot(data = df, aes(x=time, y=estimate, group=strata, color= strata)) +
geom_line(size = 1.5)+
geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.2)+
geom_vline(aes(xintercept=365))+
geom_text(aes(x = 500,y=.8,label = paste0("p = " ,pvalue) ))+
scale_y_continuous("Survival",
limits = c(0,1))+
scale_x_continuous("Time")+
scale_color_manual(" ", values = c("grey", "blue"))+
scale_fill_discrete(guide = FALSE)+
theme(axis.text.x = element_text(angle = 45, hjust = 1, size=14),
axis.title.x = element_text(size =14),
axis.text.y = element_text(size = 14),
strip.text.x = element_text(size=14),
axis.title.y = element_blank())+
theme_bw()
#First censor and make follow time to the time point of interest
lung2 <- lung %>%
mutate(time2 = ifelse(time >= 365, 365, time),
status2 = ifelse(time >= 365, 1,status))
#Compute log rank test using survdiff
sdf <- survdiff(Surv(time2,status2)~sex,data = lung2)
#extract p-value
p.val <- round(1 - pchisq(sdf$chisq, length(sdf$n) - 1),3)
In the ggplot
code above you can replace pvalue
with p.val
so it shows the log rank score.