r statistics curve-fitting survival-analysis

fitting a distribution to survival curve

I've got the following data representing a survival function.

# A tibble: 53 x 2
   month survival
   <int>    <dbl>
 1     0    1.00 
 2     1    1.00 
 3     2    1.00 
 4     3    1.00 
 5     4    1.00 
 6     5    1.00 
 7     6    0.999
 8     7    0.998
 9     8    0.997
10     9    0.993
11    10    0.984
12    11    0.976
13    12    0.973
14    13    0.971
15    14    0.969
16    15    0.969
17    16    0.969
18    17    0.969
19    18    0.968
20    19    0.968
21    20    0.968
22    21    0.968
23    22    0.968
24    23    0.968
25    24    0.967
26    25    0.966
27    26    0.966
28    27    0.962
29    28    0.957
30    29    0.952
31    30    0.948
32    31    0.944
33    32    0.942
34    33    0.941
35    34    0.941
36    35    0.941
37    36    0.941
38    37    0.940
39    38    0.939
40    39    0.938
41    40    0.938
42    41    0.938
43    42    0.935
44    43    0.934
45    44    0.930
46    45    0.920
47    46    0.910
48    47    0.895
49    48    0.884
50    49    0.881
51    50    0.879
52    51    0.878
53    52    0.878

I would like to fit a distribution to a survival curve. To do so, first I plot the survival with respect to month. Then I use fitdist function in order to fit a few distributions.

library('fitdistrplus')
library('flexsurv') 
data <- tibble(month = 0:52, survival = c(1, 1, 1, 1, 1, 1, 0.999, 0.998, 
0.997, 0.993, 0.984, 0.976, 0.973, 0.971, 0.969, 0.969, 0.969, 0.969, 0.968, 
0.968, 0.968, 0.968, 0.968, 0.968, 
0.967, 0.966, 0.966, 0.962, 0.957, 0.952, 0.948, 0.944, 
0.942, 0.941, 0.941, 0.941, 0.941, 0.940, 0.939, 0.938, 
0.938, 0.938, 0.935, 0.934, 0.930, 0.920, 0.910, 0.895, 
0.884, 0.881, 0.879, 0.878, 0.878))

data %>% ggplot(aes(month, survival)) + geom_line() 

fit_weibull <- fitdist(data[['survival']], 'weibull')
fit_llogis <- fitdist(data[['survival']], "llogis")
fit_log <- fitdist(data[['survival']], "logis")

fit_weibull$aic
fit_llogis$aic
fit_log$aic

According to AIC I should go for the Weibull distribution with a shape = 34.6167936 and scale = 0.9695298. But I've got a problem with understanding how exactly should I use this distribution to calculate my estimated survival. I was confident that because S(t) = 1 - F(t) I should just calculate 1 -pweibull(data[['month']], fit_weibull$estimate[['shape']], fit_weibull$estimate[['scale']]), but it results in following vector:

 [1] 1.00000000 0.05399642 0.00000000 0.00000000 0.00000000 0.00000000 
 0.00000000 0.00000000
 [9] 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 
 0.00000000 0.00000000
 [17] 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 
 0.00000000 0.00000000
 [25] 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 
 0.00000000 0.00000000
 [33] 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 
 0.00000000 0.00000000
 [41] 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 
 0.00000000 0.00000000
 [49] 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000

So my understanding seems to be awfully wrong. How should I use fit_weibull to estimate a survival and plot the estimated curve then?

Solution

You've got a non-standard version of survival analysis to deal with here. Normally survival analysis data is recorded in terms of discrete events (times at which individuals die) - that's what the flexsurv package (which you loaded but as far as I can see didn't use) would expect.

Unfortunately fitdistrplus::fitdist won't work for your data either - that would expect a distribution of survival times. Furthermore, even if you did have data on independent survival times, your data are censored (only 12% of the individuals have died/failed by the end of the time period); I don't know if fitdist allows for censoring or not.

You probably won't be able to make very strong statistical conclusions about differences among curves, because you don't know (or at least you haven't said) how many independent trials are actually represented by this survival curve - e.g. was the initial cohort made up of 10, 100, or 10^6 individuals ... ?

However, you can fit curves as following:

dat <- data.frame(month = 0:52, 
  survival = c(1, 1, 1, 1, 1, 1, 0.999, 0.998, 
  0.997, 0.993, 0.984, 0.976, 0.973, 0.971, 0.969, 0.969, 0.969, 0.969, 0.968, 
  0.968, 0.968, 0.968, 0.968, 0.968, 
  0.967, 0.966, 0.966, 0.962, 0.957, 0.952, 0.948, 0.944, 
  0.942, 0.941, 0.941, 0.941, 0.941, 0.940, 0.939, 0.938, 
  0.938, 0.938, 0.935, 0.934, 0.930, 0.920, 0.910, 0.895, 
  0.884, 0.881, 0.879, 0.878, 0.878))

Fit by nonlinear least-squares (not a great statistical model, but adequate). Also: need good starting values.

n1 <- nls(survival~pweibull(month,exp(logshape),exp(logscale),
                      lower.tail=FALSE),
    start=list(logshape=0,logscale=log(20)),data=dat)
n2 <- nls(pmin(survival,0.999)~plogis(month,location,exp(logscale),
                            lower.tail=FALSE),
          start=list(location=40,logscale=log(20)),data=dat)

Plot outcomes:

par(bty="l",las=1)
plot(survival~month,data=dat,type="l")
lines(dat$month,predict(n1),col="red")
lines(dat$month,predict(n2),col="blue")