I'm writing a C application which is run across a compute cluster (using condor). I've tried many methods to reveal the offending code but to no avail.
Clues:
Attempts:
Sadly the line number outputted is fairly random. I'm not entirely sure what I can do with the stacktrace. Am I correct in assuming that it only records the address of the function in which the segfault occurs?
Suspicions:
UPDATE
Researching this some more I've found the following links:
LibSegFault - a library for automatically catching and printing state data about segfaults.
Stack unwinding (stack trace) with GCC tutorial on catching segfaults and get the line numbers of the offending instructions.
UPDATE 2
Greg suggested looking at the condor log and to 'correlate the segfaults to when condor restarts the executable from a checkpoint'. Looking at the logs the segfaults all occur immediately after a restart. All of the failures appear to occur when a job switches from one type of machine to another type.
UPDATE 3
The segfault was being caused by differences between hosts, by setting the 'requiremets' field in the condor submit file to problem completely disappeared.
One can set individual machines:
requirements = machine == "hostname1" || machine == "hostname2"
or an entire class of machines:
requirements = classOfMachinesName
See requirements example here
if you can, compile with debugging, and run under gdb. alternatively, get core dumped and load that into debugger.
mpich has built-in debugger, or you can buy commercial parallel debugger.
Then you can step through the code to see what happening in debugger