Search code examples
rsimulationweibullsurvival

simulation of a weibull duration model


I want to create an event variable that follows Weibull distribution. The important thing is that the variable should be a combination of a few other observed variables.

Eg: Death is the time to event variable follow Weibull distribution which I want to simulate (here my time scale is age). I already have (simulated) variables such as age, sex, BMI and 4 stages of cancer(categorical variable with 4 categories), So using these 4 variables I want to simulate the time to event death variable.

Let me know if there is any need for clarification


Solution

  • If I am not mistaken you are interested in an accelerated failure time (AFT) Weibull model.

    The survival function is:

    S(t) = exp(- lambda t^p)

    with lambda and p being the scale and shape parameter. The objective is to parametrize lambda. If you solve for t, and assume a fixed probability S(t) = q will get

    t = A*B

    where A = (− log(q))^ 1/p and B = (1/lambda)^(1/p)

    For a binary treatment indicator TREAT, parametrize lambda: B = exp(beta_0 + beta_1*TREAT). The acceleration factor is exp(beta_1) (you can see this by taking the ratio of the A*B expression for a treatment variable relative to a control variable).

    You can simulate your data following the AB expression above, being careful with the coefficients, the random component, and the fixed probability component. In particular, if you use a normal distribution, extreme values can result in negative time, which makes no sense. Time needs to be non-negative.

    set.seed(123)
    library(data.table)
    library(survival)
    
    # generate data
    # (can use base r or dplyr if not familiar with data.table)
    n <- 2000
    d <- data.table(id=1:n,
                    age = runif(n,40,80),
                    male = rbinom(n,1,0.5),
                    bmi = runif(n,15,30),
                    cancer = sample(letters[1:4], n, replace = T), # cancer stages
                    e = runif(n, 0,2) ) # some error, uniform for instance
    
    # you will need to transform the cancer variable into numeric,
    # one category will be the comparison group
    d[, cancer_a := ifelse(cancer=="a", 1, 0)]
    d[, cancer_b := ifelse(cancer=="b", 1, 0)]
    d[, cancer_c := ifelse(cancer=="c", 1, 0)]
    
    # add S(t)
    shape <- 1
    d[, s_tcomp := (-log(0.01))^(1/shape) ]
    
    # generate the time
    d[, time := s_tcomp*exp( -0.001*age - 0.1*male + 0.1*bmi + 0.3*cancer_a + 0.2*cancer_b + 0.1*cancer_c + e)]
    
    
    #' In case you want to add censoring:
    #' we measure time only up to a certain period,
    #' if didnt die so far then still alive
    censor <- quantile(d[,time], 0.9)
    d[, dead := ifelse(time<censor, 1, 0) ]
    d[, time := pmin(time, censor) ]
    
    m <- survreg( Surv(time, dead) ~ age + male + bmi + cancer_a + cancer_b + cancer_c,
             data=d, dist = "weibull",  )
    
    summary(m)
    
    Call:
    survreg(formula = Surv(time, dead) ~ age + male + bmi + cancer_a + 
        cancer_b + cancer_c, data = d, dist = "weibull")
                    Value Std. Error      z       p
    (Intercept)  2.791584   0.098517  28.34 < 2e-16
    age         -0.000943   0.001091  -0.86  0.3874
    male        -0.058586   0.024720  -2.37  0.0178
    bmi          0.099430   0.003071  32.37 < 2e-16
    cancer_a     0.297261   0.034977   8.50 < 2e-16
    cancer_b     0.177142   0.034474   5.14 2.8e-07
    cancer_c     0.101467   0.034039   2.98  0.0029
    Log(scale)  -0.650129   0.018555 -35.04 < 2e-16
    
    Scale= 0.522 
    
    Weibull distribution
    Loglik(model)= -10272.5   Loglik(intercept only)= -10751
        Chisq= 956.93 on 6 degrees of freedom, p= 1.8e-203 
    Number of Newton-Raphson Iterations: 5 
    n= 2000
    

    See also:

    https://cran.r-project.org/web/packages/coxed/vignettes/simulating_survival_data.html

    https://cran.r-project.org/web/packages/simsurv/vignettes/simsurv_usage.html

    https://www.ms.uky.edu/~mai/Rsurv.pdf