Search code examples
rsurvival-analysissurvival

Nan and Inf in survreg model summary


I'm using survival analysis to evaluate the relative distance (instead of time, as it's usually the case with survival statistics) before a given event happened. As the dataset I'm working with is quite big, you can download the .rds file of my dataset here

When modeling the relative distance using survreg(), I encountered NaN and Inf z and p values (presumably deriving from 0 values of Std Error) in the model summary:

Call:
survreg(formula = Surv(RelDistance, Status) ~ Backshore + LowerBSize + 
    I(LowerBSize^2) + I(LowerBSize^3) + State, data = DataLong, 
    dist = "exponential")
                            Value Std. Error        z         p
(Intercept)                   2.65469   1.16e-01  22.9212 2.85e-116
BackshoreDune                -0.08647   9.21e-02  -0.9387  3.48e-01
BackshoreForest / Tree (>3m) -0.07017   0.00e+00     -Inf  0.00e+00
BackshoreGrass - pasture     -0.79275   1.63e-01  -4.8588  1.18e-06
BackshoreGrass - tussock     -0.14687   1.00e-01  -1.4651  1.43e-01
BackshoreMangrove             0.08207   0.00e+00      Inf  0.00e+00
BackshoreSeawall             -0.47019   1.43e-01  -3.2889  1.01e-03
BackshoreShrub (<3m)         -0.14004   9.45e-02  -1.4815  1.38e-01
BackshoreUrban / Building     0.00000   0.00e+00      NaN       NaN
LowerBSize                   -0.96034   1.96e-02 -49.0700  0.00e+00
I(LowerBSize^2)               0.06403   1.87e-03  34.2782 1.66e-257
I(LowerBSize^3)              -0.00111   3.84e-05 -28.8070 1.75e-182
StateNT                       0.14936   0.00e+00      Inf  0.00e+00
StateQLD                      0.09533   1.02e-01   0.9384  3.48e-01
StateSA                       0.01030   1.06e-01   0.0973  9.22e-01
StateTAS                      0.19096   1.26e-01   1.5171  1.29e-01
StateVIC                     -0.07467   1.26e-01  -0.5917  5.54e-01
StateWA                       0.24800   9.07e-02   2.7335  6.27e-03

Scale fixed at 1 

Exponential distribution
Loglik(model)= -1423.4   Loglik(intercept only)= -3282.8
    Chisq= 3718.86 on 17 degrees of freedom, p= 0 
Number of Newton-Raphson Iterations: 6 
n= 6350 

I thought the Inf and NaN could be caused by data separation, and merged some levels of Backshore together:

levels(DataLong$Backshore)[levels(DataLong$Backshore)%in%c("Grass - 
pasture", "Grass - tussock", "Shrub (<3m)")] <- "Grass - pasture & tussock 
/ Shrub(<3m)"
levels(DataLong$Backshore)[levels(DataLong$Backshore)%in%c("Seawall", 
"Urban / Building")] <- "Anthropogenic"
levels(DataLong$Backshore)[levels(DataLong$Backshore)%in%c("Forest / Tree 
(>3m)", "Mangrove")] <- "Tree(>3m) / Mangrove"

but the issue persists when running the model again (i.e. Backshore Tree(>3m) / Mangrove).

Call:
survreg(formula = Surv(RelDistance, Status) ~ Backshore + LowerBSize + 
    I(LowerBSize^2) + I(LowerBSize^3) + State, data = DataLong, 
    dist = "exponential")
                                              Value Std. Error       z         p
(Intercept)                                      2.6684   1.18e-01  22.551 1.32e-112
BackshoreDune                                   -0.1323   9.43e-02  -1.402  1.61e-01
BackshoreTree(>3m) / Mangrove                   -0.0530   0.00e+00    -Inf  0.00e+00
BackshoreGrass - pasture & tussock / Shrub(<3m) -0.2273   8.95e-02  -2.540  1.11e-02
BackshoreAnthropogenic                          -0.5732   1.38e-01  -4.156  3.24e-05
LowerBSize                                      -0.9568   1.96e-02 -48.920  0.00e+00
I(LowerBSize^2)                                  0.0639   1.87e-03  34.167 7.53e-256
I(LowerBSize^3)                                 -0.0011   3.84e-05 -28.713 2.59e-181
StateNT                                          0.2892   0.00e+00     Inf  0.00e+00
StateQLD                                         0.0715   1.00e-01   0.713  4.76e-01
StateSA                                          0.0507   1.05e-01   0.482  6.30e-01
StateTAS                                         0.1990   1.26e-01   1.581  1.14e-01
StateVIC                                        -0.0604   1.26e-01  -0.479  6.32e-01
StateWA                                          0.2709   9.05e-02   2.994  2.76e-03

Scale fixed at 1 

Exponential distribution
Loglik(model)= -1428.4   Loglik(intercept only)= -3282.8
    Chisq= 3708.81 on 13 degrees of freedom, p= 0 
Number of Newton-Raphson Iterations: 6 
n= 6350 

I've looked for an explanation for this behaviour pretty much everywhere in the survival package documentation and online, but I couldn't find anything that related to this.

Does anyone know what could be the cause of Inf and NaNs in this case?


Solution

  • @MarcoSandri is correct that censoring is confounded with LowerBSize, but I'm not sure that's the entire solution. It could explain why the model is so unstable, but that by itself shouldn't (AFAICT) make the model ill-posed. If I replace LowerBSize+ I(LowerBSize^2) + I(LowerBSize^3) with an orthogonal polynomial (poly(LowerBSize,3)) I get more reasonable-looking answers:

    ss3 <- survreg(formula = Surv(RelDistance, Status) ~ Backshore +
                       poly(LowerBSize,3) + State, data = DataLong, 
                   dist = "exponential")
    
    Call:
    survreg(formula = Surv(RelDistance, Status) ~ Backshore + poly(LowerBSize, 
        3) + State, data = DataLong, dist = "exponential")
                                     Value Std. Error      z       p
    (Intercept)                   2.18e+00   1.34e-01  16.28 < 2e-16
    BackshoreDune                -1.56e-01   1.06e-01  -1.47 0.14257
    BackshoreForest / Tree (>3m) -2.24e-01   2.01e-01  -1.11 0.26549
    BackshoreGrass - pasture     -8.63e-01   1.74e-01  -4.97 6.7e-07
    BackshoreGrass - tussock     -2.14e-01   1.13e-01  -1.89 0.05829
    BackshoreMangrove             3.66e-01   4.59e-01   0.80 0.42519
    BackshoreSeawall             -5.37e-01   1.53e-01  -3.51 0.00045
    BackshoreShrub (<3m)         -2.08e-01   1.08e-01  -1.92 0.05480
    BackshoreUrban / Building    -1.17e+00   3.22e-01  -3.64 0.00028
    poly(LowerBSize, 3)1         -6.58e+01   1.41e+00 -46.63 < 2e-16
    poly(LowerBSize, 3)2          5.09e+01   1.19e+00  42.72 < 2e-16
    poly(LowerBSize, 3)3         -4.05e+01   1.41e+00 -28.73 < 2e-16
    StateNT                       2.61e-01   1.93e-01   1.35 0.17557
    StateQLD                      9.72e-02   1.12e-01   0.87 0.38452
    StateSA                      -4.11e-04   1.15e-01   0.00 0.99715
    StateTAS                      1.91e-01   1.35e-01   1.42 0.15581
    StateVIC                     -9.55e-02   1.35e-01  -0.71 0.47866
    StateWA                       2.46e-01   1.01e-01   2.44 0.01463
    

    If I fit exactly the same model but with poly(LowerBSize,3,raw=TRUE) (calling the result ss4, see below) I get your pathologies again. Furthermore, the model with orthogonal polynomials actually fits better (has a higher log-likelihood):

    logLik(ss4)
    ## 'log Lik.' -1423.382 (df=18)
    logLik(ss3)
    ## 'log Lik.' -1417.671 (df=18)
    

    In a perfect mathematical/computational world, this shouldn't be true - it's another indication that there's something unstable about specifying the LowerBSize effects this way. I'm a little surprised this happens - the number of unique values of LowerBSize is small but shouldn't be pathological, and the range of values isn't huge or far from zero ...


    I still can't say what's really causing this, but the proximal problem is probably the strong correlation between the linear/quadratic/cubic terms: for something simpler like linear regression a correlation of 0.993 (between quad & cubic terms) doesn't cause severe problems, but the more complicated the numerical problem (e.g. survival analysis vs. linear regression), the more correlation can be an issue ...

    X <- model.matrix( ~ Backshore + LowerBSize + 
                           I(LowerBSize^2) + I(LowerBSize^3) + State,
                      data=DataLong)
    print(cor(X[,grep("LowerBSize",colnames(X))]),digits=3)
    library(corrplot)
    png("survcorr.png")
    corrplot.mixed(cor(X[,-1]),lower="ellipse",upper="number",
                 tl.cex=0.4)
    dev.off()
    

    enter image description here