I'm trying to fit a large discrete proportional-hazards model (~100k rows, ~10k events). To do this I used coxph(..., method = "exact")
as recommended by the survival package documentation documentation, which states:
The “exact partial likelihood” is equivalent to a conditional logistic model, and is appropriate when the times are a small set of discrete values. If there are a large number of ties and (start, stop) style survival data the computational time will be excessive.
There were some warnings about computational difficulty with coxph
and large numbers of ties, but according to the documentation for clogit
in the same package:
The computation of the exact partial likelihood can be very slow, however. If a particular strata had, say 10 events out of 20 subjects we have to add up a denominator that involves all possible ways of choosing 10 out of 20, which is 20!/(10! 10!) = 184756 terms. Gail et al describe a fast recursion method which largely ameleorates this; it was incorporated into version 2.36-11 of the
survival
package.
So I didn't expect the computational issues to be too bad. Nevertheless, I've run into many segmentation faults when trying to fit variants of a trivial (one-predictor) Cox model on my dataset. One is a "C stack overflow," resulting in the short and sweet (and uninformative) message:
Error: segfault from C stack overflow
Execution halted
The other is a "memory not mapped" error, which occurred when I accidentally flipped the "event" boolean so that I had ~90k events instead of ~10k:
*** caught segfault ***
address 0xffffffffac577830, cause 'memory not mapped'
Traceback:
1: fitter(X, Y, strats, offset, init, control, weights = weights, method = method, row.names(mf))
2: coxph(Surv(time, status == EVENT.STATUS) ~ litter, data = data, method = "exact")
aborting ...
For reference, the code I'm running is simply coxph(Surv(t, d) ~ x, data = data, method = 'exact')
. t
is an integer column, d
is Boolean and x
is a float.
Are these known issues? Are there workarounds?
EDIT: Here's some code reproducing the problem on the rats
dataset (replicated 1000 times):
library(survival)
print("constructing data")
data <- rats
SIZE <- nrow(rats)
# passes with 100 reps, but fails with 100 on my machine (MacBook Pro, 16g RAM)
REPS <- 1000
# set to 0 for "memory not mapped", 1 for "C stack overflow"
EVENT.STATUS <- 0
data <- data[rep(seq_len(SIZE), REPS), ]
print(summary(data$status == EVENT.STATUS))
print("fitting model")
fit <- coxph(Surv(time, status == EVENT.STATUS) ~ litter,
data = data, method = "exact")
And here's version
:
platform x86_64-apple-darwin14.0.0
arch x86_64
os darwin14.0.0
system x86_64, darwin14.0.0
status
major 3
minor 1.2
year 2014
month 10
day 31
svn rev 66913
language R
version.string R version 3.1.2 (2014-10-31)
nickname Pumpkin Helmet
I am able to make a Poisson model with that dataset. (I've got a large dataset that I'm unwilling to risk a probable segfault on.)
fit <- glm( I(status == 0) ~ litter +offset(log(time)),
data = data, family=poisson)
> fit
Call: glm(formula = I(status == 0) ~ litter + offset(log(time)), family = poisson,
data = data)
Coefficients:
(Intercept) litter
-4.706485 -0.003883
Degrees of Freedom: 149999 Total (i.e. Null); 149998 Residual
Null Deviance: 60500
Residual Deviance: 60150 AIC: 280200
This estimate of the effect of litter
should be similar to what you would get from a Cox PH model.
If you want to see the "offset trick" documented, go to Breslow and Day's classic monograph: "Statistical Methods in Cancer Research; Vol II- The Design and Analysis of Cohort Studies". They used the GLIM software package, but the code is very similar to R's glm
implementation, so the transport of concepts should be straightforward. (I had the opportunity to work briefly with Norm Breslow on my Masters thesis using GLIM. He was brilliant. I think it was my prior training with GLIM that made picking up R so easy.)