python survival-analysis cox-regression survival lifelines

lifelines/scikit-survival: Calculation of the expected times

I am trying to understand how to calculate the expected time for the each of my ids in my dataset. I have a dataset that looks like a Dataframe shaped (500,4):

ids var1       var2  churn     time
0   1.738434    324    0       21.0
1   1.541176    12     0       4.0
2   2.049281    753    1       5.0
3   1.929860    563    0       16.0
4   1.595027    22     0       5.0
... ... ... ... ...

let's take lifelines to calculate the expected value using either predict_expectation or by taking the median of the survival function for each ID.

Part 1: Calculate the expected values

cph = CoxPHFitter()
cph.fit(data,"time","churn")

censored_df = data[data["churn"]==0]

cph.predict_expectation(censored_df) #conditional_after=censored_df["time"])
#or
cph.predict_median(censored_df) #conditional_after=censored_df["time"])

For scikit-survival is calculated using the predict_survival_function()

Concordance index = 0.82

Part 2: Compare the results with the actuals

So now I have created a table using both methods: predict_expectation()("expected" column) and predict_median ("median" column) that looks like this:

for scikit-survival it can only be calculated by taking the median (please not that I am aware that for other algorithms in lifelines\scikit-learn might be different, but focus on the idea)


ids churn time  expected    diff_expectation median diff_median
0   0   21.0    21.526222   0.526222          8.0     -13.0
1   0   4.0     21.819911   17.819911         13.0     9.0
3   0   16.0    23.189344   7.189344          9.0     -7.0
4   0   5.0     22.090598   17.090598         12.0     7.0
6   0   8.0     21.545022   13.545022         10.0     2.0
... ... ... ... ... ... ...

The columns with "diff" represent the difference between the respective predicted column and "time"

Questions

Why are the expected times so off?
Is there anything wrong with approach? Should I predict in the whole data (censored+uncensored) or just with the censored? (I have tried the three possible permutations, only censored, only uncensored, both, and it is still off). My understanding is that if the survival curve for each ID converge to 0 (uncensored data) you can calculate using area under the curve, if it is censored you need to use the median of the surv curve. (I have done the above calculation keeping that in mind)
How can I achieve a closer estimate?
if run the experiment and fit the model only on uncensored data and then predict on that same uncensored data, should you be getting a very close estimate, right? Well This is not the case. You should be able to check this by taking the average from the expected medians and it should be similar to the median of the actual values, right? Or you can check taking the mean of the "diff" column to see if it at least averages to 0, but this is not the case, which shows some potential bias in the model
Why does the predict_expectation outputs something different to the predict_median? Which one is more recommended to use?

This phenomenons happens with any dataset, you can try replicating this example using the from lifelines.datasets import load_leukemia dataset, even if you get a 0.9 in your concordance index, this still happens.

Here are a few resources that I found that sort of explain this, but I don't fully understand it, if someone can break it down a bit more, that would be great.

Sources

you can find a fully coded exampled here: https://github.com/felipe0216/survival_examples/blob/main/predict_expectation_scikit.py

Solution

This article gives a nice explanation as the differences between Expectation and Median as a way to predict survival time.

Basically, the Expectation is a good prediction only if the data you're dealing with eventually reaches survival probability S(t)=0, because if it doesn't the expectation (calculated from the area under the line) will be infinity.

In this case, the median (the time at which probability crosses 0.5) would be more appropriate. However, sometimes we might have data which doesn't ever reach S(t)=0.5.

So I think the answer is that it depends.