r predict survival-analysis cox-regression cox

Coxph predictions don't match the coefficients

Good afternoon,

I could post reproducible code and certainly will if everyone agrees that something is wrong, but right now I think my question is quite simple and someone will point me the right path.

I am working in a data set like this:

created_as_free_user     t     c
                 <fctr> <int> <int>
1                  true    36     0
2                  true    36     0
3                  true     0     1
4                  true    28     0
5                  true     9     0
6                  true     0     1
7                  true    13     0
8                  true    19     0
9                  true     9     0
10                 true    16     0

I fitted a Cox Regression model like this:

fit_train = coxph(Surv(time = t,event = c) ~ created_as_free_user ,data = teste)
summary(fit_train)

And received:

Call:
coxph(formula = Surv(time = t, event = c) ~ created_as_free_user, 
    data = teste)

  n= 9000, number of events= 1233 

                            coef exp(coef) se(coef)      z Pr(>|z|)    
created_as_free_usertrue -0.7205    0.4865   0.1628 -4.426 9.59e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

                         exp(coef) exp(-coef) lower .95 upper .95
created_as_free_usertrue    0.4865      2.055    0.3536    0.6693

Concordance= 0.511  (se = 0.002 )
Rsquare= 0.002   (max possible= 0.908 )
Likelihood ratio test= 15.81  on 1 df,   p=7e-05
Wald test            = 19.59  on 1 df,   p=9.589e-06
Score (logrank) test = 20.45  on 1 df,   p=6.109e-06

So far so good. Next step: Predict the results on new data. I understand the different types of predictions that predict.coxph can give me (or at least I think I do). Let's use the type = "lp":

head(predict(fit_train,validacao,type = "lp"),n=20)

And get:

     1           2           3           4           5           6           7           8           9          10 
-0.01208854 -0.01208854 -0.01208854 -0.01208854 -0.01208854 -0.01208854 -0.01208854 -0.01208854 -0.01208854 -0.01208854 
         11          12          13          14          15          16          17          18          19          20 
-0.01208854 -0.01208854  0.70842049 -0.01208854 -0.01208854 -0.01208854 -0.01208854 -0.01208854 -0.01208854 -0.01208854

OK. But when I look at the data that I am trying to estimate:

# A tibble: 9,000 × 3
   created_as_free_user     t     c
                 <fctr> <int> <int>
1                  true    20     0
2                  true    12     0
3                  true     0     1
4                  true    10     0
5                  true    51     0
6                  true    36     0
7                  true    44     0
8                  true     0     1
9                  true    27     0
10                 true     6     0
# ... with 8,990 more rows

It makes me confuse....

The type = "lp" isn't suppose to give you the linear predictions? For this data above that I am trying to estimate, since the created_as_free_user variable is equal to true, Am I wrong expecting the type = "lp" prediction to be exactaly -0.7205 (the coefficient of the model above)? Where does the -0.01208854 came? I suspect it's some sort of scale situation, but couldn't find the answer online.

My final objective is the h(t) that is given by the prediction type = "expected", but I am not all that comfortable using it because it uses this -0.01208854 value that I don't fully understand.

Thanks a lot

Solution

The Details section in ?predict.coxph reads:

The Cox model is a relative risk model; predictions of type "linear predictor", "risk", and "terms" are all relative to the sample from which they came. By default, the reference value for each of these is the mean covariate within strata.

To illustrate what this means, we can look at a simple example. Some fake data:

test1 <- list(time=c(4,3,1,1,1), 
             status=c(1,1,1,0,0), 
             x=c(0,2,1,1,0))

We fit a model and view predictions:

fit <- coxph(Surv(time, status) ~ x, test1) 
predict(fit, type = "lp")
# [1] -0.6976630  1.0464945  0.1744157  0.1744157 -0.6976630

The predictions are the same as:

(test1$x - mean(test1$x)) * coef(fit)
# [1] -0.6976630  1.0464945  0.1744157  0.1744157 -0.6976630

(Using this logic and some arithmetic we can back out from your results that you have 8849 "trues" out of 9000 observations for your created_as_free_user variable.)