Search code examples
rcross-validationsurvival-analysiscox-regression

Is there a way to get the partial likelihood of a Cox PH Model with new data and fixed coefficients?


I'm performing a cross validation on a competing risks proportional hazards model. With help from the mstate pacakge, I've prepared my data and am fitting it with survival::coxph. I get a fitted Cox model object for my training data, but I want to evaluate the partial likelihood of my trained coefficients with my test data.

If I need to, I'll write the partial likelihood function myself, but I'd rather not (though it would probably be good for me). The survival package calculates in this C code, but the likelihood calculation is embedded in the fitting function. Maybe there's a way to fix parameters, or some other tools to easily get at the partial likelihood?

Minimum Working Exmaple

# Adapted from examples in the mstate vignette
# http://cran.r-project.org/web/packages/mstate/vignettes/Tutorial.pdf
# beginning at the bottom of page 28

library(mstate)
library(survival)

# Get data. I add a second explanatory variable (badx) for illustration
# Also divide the data by subject into training and test sets.
data(aidssi)
si <- aidssi # Just a shorter name
si$badx <- sample(c("A", "B"), size = nrow(si), replace = TRUE)
si$fold <- sample(c("train", "test"), size = nrow(si), replace = TRUE, prob = c(0.7, 0.3))
tmat <- trans.comprisk(2, names = c("event-free", "AIDS", "SI"))
si$stat1 <- as.numeric(si$status == 1)
si$stat2 <- as.numeric(si$status == 2)

# Convert the data to a long competing risks format
silong <- msprep(time = c(NA, "time", "time"), 
                 status = c(NA,"stat1", "stat2"),
                 data = si, keep = c("ccr5", "badx", "fold"), trans = tmat)
silong <- na.omit(silong)
silong <- expand.covs(silong, c("ccr5", "badx"))
train.dat <- subset(silong, fold == "train")
test.dat <- subset(silong, fold == "test")

Data looks like this:

> head(silong)
An object of class 'msdata'

Data:
  id from to trans Tstart  Tstop   time status ccr5 badx  fold ccr5WM.1 ccr5WM.2 badxB.1 badxB.2
1  1    1  2     1      0  9.106  9.106      1   WW    A train        0        0       0       0
2  1    1  3     2      0  9.106  9.106      0   WW    A train        0        0       0       0
3  2    1  2     1      0 11.039 11.039      0   WM    B train        1        0       1       0
4  2    1  3     2      0 11.039 11.039      0   WM    B train        0        1       0       1
5  3    1  2     1      0  2.234  2.234      1   WW    B train        0        0       1       0
6  3    1  3     2      0  2.234  2.234      0   WW    B train        0        0       0       1

Now, the ccr5 variable could be modeled as transition-specific, or as a having equal proportional effect for all transitions. The models are:

train.mod.equal <- coxph(Surv(time, status) ~ ccr5 + badx + strata(trans),
                         data = train.dat)
train.mod.specific <- coxph(Surv(time, status) ~ ccr5WM.1 + ccr5WM.2 + badx + strata(trans),
                            data = train.dat)

Now I would like to use the test data to evaluate the variable selection on whether or not ccr5 should be transition-specific or not. I have a large data set and many variables--mostly but not all categorical--that could go either way. The evaluation is where I'm stuck.

# We can fit the same models to the test data,
# this yields new parameter estimates of course,
# but the model matrices might be useful
test.mod.equal <- coxph(Surv(time, status) ~ ccr5 + badx + strata(trans),
                         data = test.dat)
test.mod.specific <- coxph(Surv(time, status) ~ ccr5WM.1 + ccr5WM.2 + badx + strata(trans),
                            data = test.dat)
test.eq.mm <- model.matrix(test.mod.equal)
test.sp.mm <- model.matrix(test.mod.specific)

# We can use these to get the first part of the sum of the partial likelihood:
xbeta.eq <- test.eq.mm[test.dat$status == 1, ] %*% coef(train.mod.equal)
xbeta.sp <- test.sp.mm[test.dat$status == 1, ] %*% coef(train.mod.specific)

# We can also get linear predictors
lp.eq <- predict(train.mod.equal, newdata = test.dat, type = "lp")
lp.sp <- predict(train.mod.specific, newdata = test.dat, type = "lp")

I'm hoping to calculate the partial likelihood for each of the models on the test data with the training coefficient estimates. Maybe I should move the question to Cross Validated and ask if the sum of the linear predictors (or the sum of the linear predictors excluding censored cases) is close enough to an equivalent measure.


Solution

  • This is what I was proposing when I wrote: 'Can you calculate a "neo-model" (using the [new data] with a formula that includes an offset [built with] beta estimates [from the original fit] and then use summary(mdl) to do the heavy lifting for you? You might even be able to calculate the offset with predict.coxph.' Turns out I don't need to use summary.coxph since print.coxph gives the LLR statistic.

     lp.eq <- predict(train.mod.equal, newdata = test.dat, type = "lp")
     eq.test.mod <- coxph(Surv(time, status) ~ ccr5 + badx + strata(trans)+offset(lp.eq), 
       data=test.dat )
    eq.test.mod
    
    Call:
    coxph(formula = Surv(time, status) ~ ccr5 + badx + strata(trans) + 
        offset(lp.eq), data = test.dat)
    
    
               coef exp(coef) se(coef)       z    p
    ccr5WM -0.20841     0.812    0.323 -0.6459 0.52
    badxB  -0.00829     0.992    0.235 -0.0354 0.97
    
    Likelihood ratio test=0.44  on 2 df, p=0.804  n= 212, number of events= 74 
    

    I would interpret this to mean that a similar model, fit with the predictions based on the first model but with new data, was not significantly different (than a null model) and that on a log-likelihood scale, it was 0.44 "away" from an exact fit.

    As pointed out by @Gregor, one can access the 'loglik' node of the coxph-object, but I would advise against attaching too much meaning to the single values. To get he LRT statistic one could produce:

    > diff(eq.test.mod$loglik)
    [1] 0.399137
    

    For interest sake, also look at the result without the offset:

    > coxph(Surv(time, status) ~ ccr5 + badx + strata(trans), 
    +       data=test.dat)
    Call:
    coxph(formula = Surv(time, status) ~ ccr5 + badx + strata(trans), 
        data = test.dat)
    
    
              coef exp(coef) se(coef)      z      p
    ccr5WM -0.8618     0.422    0.323 -2.671 0.0076
    badxB  -0.0589     0.943    0.235 -0.251 0.8000
    
    Likelihood ratio test=8.42  on 2 df, p=0.0148  n= 212, number of events= 74 
    

    And you do get the expected result when testing against the original data:

    > lp.eq2 <- predict(train.mod.equal, newdata = train.dat, type = "lp")
    > coxph(Surv(time, status) ~ ccr5 + badx + strata(trans)+offset(lp.eq2), 
    +       data=train.dat)
    Call:
    coxph(formula = Surv(time, status) ~ ccr5 + badx + strata(trans) + 
        offset(lp.eq2), data = train.dat)
    
    
                coef exp(coef) se(coef)         z p
    ccr5WM -4.67e-12         1    0.230 -2.03e-11 1
    badxB   2.57e-14         1    0.168  1.53e-13 1
    
    Likelihood ratio test=0  on 2 df, p=1  n= 436, number of events= 146