Search code examples
rsegmentation-faultstack-overflowcox-regression

Segmentation faults in R's Cox regression with "exact" ties


I'm trying to fit a large discrete proportional-hazards model (~100k rows, ~10k events). To do this I used coxph(..., method = "exact") as recommended by the survival package documentation documentation, which states:

The “exact partial likelihood” is equivalent to a conditional logistic model, and is appropriate when the times are a small set of discrete values. If there are a large number of ties and (start, stop) style survival data the computational time will be excessive.

There were some warnings about computational difficulty with coxph and large numbers of ties, but according to the documentation for clogit in the same package:

The computation of the exact partial likelihood can be very slow, however. If a particular strata had, say 10 events out of 20 subjects we have to add up a denominator that involves all possible ways of choosing 10 out of 20, which is 20!/(10! 10!) = 184756 terms. Gail et al describe a fast recursion method which largely ameleorates this; it was incorporated into version 2.36-11 of the survival package.

So I didn't expect the computational issues to be too bad. Nevertheless, I've run into many segmentation faults when trying to fit variants of a trivial (one-predictor) Cox model on my dataset. One is a "C stack overflow," resulting in the short and sweet (and uninformative) message:

Error: segfault from C stack overflow
Execution halted

The other is a "memory not mapped" error, which occurred when I accidentally flipped the "event" boolean so that I had ~90k events instead of ~10k:

 *** caught segfault ***
address 0xffffffffac577830, cause 'memory not mapped'

Traceback:
 1: fitter(X, Y, strats, offset, init, control, weights = weights,     method = method, row.names(mf))
 2: coxph(Surv(time, status == EVENT.STATUS) ~ litter, data = data,     method = "exact")
aborting ...

For reference, the code I'm running is simply coxph(Surv(t, d) ~ x, data = data, method = 'exact'). t is an integer column, d is Boolean and x is a float.

Are these known issues? Are there workarounds?

EDIT: Here's some code reproducing the problem on the rats dataset (replicated 1000 times):

library(survival)
print("constructing data")
data <- rats
SIZE <- nrow(rats)
# passes with 100 reps, but fails with 100 on my machine (MacBook Pro, 16g RAM)
REPS <- 1000
# set to 0 for "memory not mapped", 1 for "C stack overflow"
EVENT.STATUS <- 0
data <- data[rep(seq_len(SIZE), REPS), ]
print(summary(data$status == EVENT.STATUS))
print("fitting model")
fit <- coxph(Surv(time, status == EVENT.STATUS) ~ litter,
             data = data, method = "exact")

And here's version:

platform       x86_64-apple-darwin14.0.0
arch           x86_64
os             darwin14.0.0
system         x86_64, darwin14.0.0
status
major          3
minor          1.2
year           2014
month          10
day            31
svn rev        66913
language       R
version.string R version 3.1.2 (2014-10-31)
nickname       Pumpkin Helmet

Solution

  • I am able to make a Poisson model with that dataset. (I've got a large dataset that I'm unwilling to risk a probable segfault on.)

    fit <- glm(  I(status == 0) ~ litter +offset(log(time)), 
                   data = data, family=poisson)
    
    > fit
    
    Call:  glm(formula = I(status == 0) ~ litter + offset(log(time)), family = poisson, 
        data = data)
    
    Coefficients:
    (Intercept)       litter  
      -4.706485    -0.003883  
    
    Degrees of Freedom: 149999 Total (i.e. Null);  149998 Residual
    Null Deviance:      60500 
    Residual Deviance: 60150    AIC: 280200
    

    This estimate of the effect of litter should be similar to what you would get from a Cox PH model.

    If you want to see the "offset trick" documented, go to Breslow and Day's classic monograph: "Statistical Methods in Cancer Research; Vol II- The Design and Analysis of Cohort Studies". They used the GLIM software package, but the code is very similar to R's glm implementation, so the transport of concepts should be straightforward. (I had the opportunity to work briefly with Norm Breslow on my Masters thesis using GLIM. He was brilliant. I think it was my prior training with GLIM that made picking up R so easy.)