My application depends on ghostscript
to turn some pdf files into a series of images for each page of the documents. This is a simplified version:
import locale
from ghostscript import Ghostscript as gs
from ghostscript import cleanup
from cv2 import imread, IMREAD_GRAYSCALE as GRAY
from multiprocessing import cpu_count
args = [
"",
"-q", "-r300", "-dNOPAUSE",
"-sDEVICE=pgmraw",
"-sOutputFile=%d.pgm",
"-dNumRenderingThreads=" + str(cpu_count()),
"-f", "_.pdf" #filename will always be "_.pdf"
]
encoding = locale.getpreferredencoding()
args = [a.encode(encoding) for a in args]
def pdftoimarray():
cleanup()
gs(*args)
imarray = []
for filename in os.listdir():
imarray.append(imread(filename, GRAY))
return imarray
(I removed the cleanup of the filesystem at the end on purpose: It's not really important for this question)
Problem is, I can't really trust the source of these documents, and some of them may be faulty. Running some tests, I discovered that some of these bad documents cause ghostscript to actually segfault, which in turn makes my entire application crash.
Normally, a segfault is a very serious event that we can't really recover from, so I'm skeptical if it is actually possible to trap it. But in my case it shouldn't be really that serious: Assuming my program is still in a valid state, I could just flag that document as bad and move on.
Question: Can I somehow trap this segmentation fault in my dependency, and recover from it?
This has been somewhat asked before in Segmentation Fault Catch, but the only answer is wrong (It suggests trapping it with signal.signal
, but the documentation clearly says that catching synchronous signals such as SIGSEGV makes little sense using it. The same documentation points to faulthandler, but it can't really trap the signal: It just provides better error messages in case it happens).
This leaves the question of how is this question unique, and not a duplicate: I'm somewhat less restricted: I'm not intending to treat the problem at all: I just want to ignore it and move on. Any points on actually avoiding the segfault in ghostscript in the first place will also be very well received.
This question is a bit old, but I thought I should share this: I was watching a video about a cool new memory allocator, and on one of the questions from the audience, the author explains that he "Installs a segfault handler", which is very much what I am interested in. I still don't know how he does it exactly, so this doesn't answer my question completely, but it gives me a good place to start researching. I'll post an answer here if I manage to figure this out myself.
Here is the video (the link is at the time he answers the question I'm talking about) https://youtu.be/c1UBJbfR-H0?t=2058
I had a similar problem, rendering cad files via pythonocc. Sometimes when opening a file the script just segfaulted. Really annoying. You had to remove the file manually and restart the batch.
So basically the idea is to start an extra process for the task and check it's exitcode:
import multiprocessing as mp
def do_stuff_that_segfaults(param):
call_shitty_library(param)
def main():
p = mp.Process(target=do_stuff_that_segfaults, args=param)
p.start()
p.join()
if p.exitcode == -11: # Segmentation fault
do_stuff_in_case_of_segfault()
I've also tried other suggestions, like the Segmentation Fault Catch you linked to but to no avail.
I really would have liked to use mp.pool()
to use all cores, but you don't get the exit status from mp.pool().
So far the code runs well and I moved the files resulting in a segfault into another folder via do_stuff_in_case_of_segfault()
without getting my main script killed.