As title said, fisher.test
crash R with *** caught segfault ***
error. Here is the code to produce the error:
d<-matrix(c(1,0,5,2,1,90,0,0,0,1,0,14,0,0,0,0,0,5,0,
0,0,0,0,2,0,0,0,0,0,2,2,1,0,2,3,89),
nrow=6,byrow = TRUE)
fisher.test(d,simulate.p.value=FALSE)
I found this, since I use the fisher.test
inside some functions. Running them on the data produced R to crash with the aforementioned error.
I understand that the table provided to fisher.test
is ill behaved, but that kind of things should not be happening, I guess.
I would appreciate any suggestions on which conditions should be met by the contingency table in order to avoid this kind of crashes due to the fisher.test
misbehavior. Also what other arguments should be set in fisher.test
in order to avoid the crash, I did a little test in which
fisher.test(d,simulate.p.value=TRUE)
does not crash and produced a result.
I am asking for this since I will have to implement that to avoid future crashes in my pipeline.
I can confirm that this is a bug in R 4.2 and that it is now fixed in the development branch of R (with this commit on 7 May). I wouldn't be surprised if it were ported to a patch-release sometime soon, but that's unknown/up to the R developers. Running your example above doesn't segfault any more, but it does throw an error:
Error in fisher.test(d, simulate.p.value = FALSE) : FEXACT[f3xact()] error: hash key 5e+09 > INT_MAX, kyy=203, it[i (= nco = 6)]= 0.
Rather set 'simulate.p.value=TRUE'
So this makes your workflow better (you can handle these errors with try()
/tryCatch()
), but it doesn't necessarily satisfy you if you really want to perform an exact Fisher test on these data. (Exact tests on large tables with large entries are extremely computationally difficult, as they essentially have to do computations over the set of all possible tables with given marginal values.)
I don't have any brilliant ideas for detecting the exact conditions that will cause this problem (maybe you can come up with a rough rubric based on the dimensions of the table and the sum of the counts in the table, e.g. if (prod(dim(d)) > 30 && sum(d) > 200)
... ?)
Setting simulate.p.value=TRUE
is the most sensible approach. However, if you expect precise results for extreme tables (e.g. you are working in bioinformatics and are going to apply a huge multiple-comparisons correction to the results), you're going to be disappointed. For example:
dd <- matrix(0, 6, 6)
dd[5,5] <- dd[6,6] <- 100
fisher.test(dd)$p.value
## 2.208761e-59, reported as "< 2.2e-16"
fisher.test(dd, simulate.p.value = TRUE, B = 10000)$p.value
# 9.999e-05
fisher.test(..., simulate.p.value = TRUE)
will never return a value smaller than 1/(B+1)
(this is what happens if none of the simulated tables are more extreme than the observed table: technically, the p-value ought to be reported as "<= 9.999e-05"). Therefore, you will never (in the lifetime of the universe) be able to calculate a p-value like 1e-59, you'll just be able to set a bound based on how large you're willing to make B
.