I'm working on a python project which calls a fortran subroutine via f2py for efficiency reasons.
When I execute the code, it fails at seemingly random (non-consistent) points with Segmentation Fault errors. Using the faulthandler
Python library, I have narrowed my search down to a Bus Error
and munmap_chunk(): Invalid Pointer
errors, though the errors are still not consistent.
Given the seemingly random nature of the error I'm afraid I can't provide a MWE. My Fortran code is (abridged -- full version here):
module event_rates
contains
subroutine event_rate_f95(events, edges, events_edges, lifetimes, lt_size, NBins)
implicit none
! define input parameters
! define internal variables
dtd = delay_time_distribution(lifetimes, edges, NBins)
print *, "DTD generated"
do i = 1, NBins+1
t1 = events_edges(i-1)
t2 = events_edges(i)
print *, "Ts done"
z1 = estimate_redshift(t1)
z2 = estimate_redshift(t2)
print *, 'computing sfr'
SFR = compute_SFR(z1, z2) / (1E-3) ** 3
print *, "about to enter inner loop"
do j = 0, i-1
! do a computation
enddo
print *, "exited inner loop"
print *, i
enddo
end subroutine
end module event_rates
Where delay_time_distribution, estimate_redshift, compute_SFR
are functions I define earlier. For reference, NBins
is approximately 50 whenever I call this. In the 3 most recent executions, it failed at:
1) i=20
inside estimate_redshift()
,
2) In the delay_time_distribution()
function,
3) After the Fortran code had terminated and returned control back to Python.
From reading background information on these errors, it appears to be a memory management problem, as Segmentation Faults are accessing memory I can't access, Bus Errors are accessing memory that isn't there, and munmap_chunk()
is passing the wrong pointer to a FREE instruction. But I'm relying on Fortran 95's inbuilt memory management to handle this for me. Monitoring htop
while code is executing shows me that my CPU usage on one core spikes, but memory usage stays constant.
My question is two-fold: what is causing these errors, and how can one debug this further in general?
There is a simple way to debug this: use debug flags.
You may know that if you're compiling FORTRAN with gfortran, you can pass -fcheck=bounds
to the gfortran
command. Similarly, you can pass --opt='-fcheck=bounds'
to the f2py3
command to debug the issue.
I was trying to access an array wrong. Consider line 122 of the pastebin:
bin_low = floor(NBins * (x1 - NBins) / (NBins))
If x1 = 0
, then bin_low = -NBins
. Which, since NBins
(the number of bins you have) is positive, becomes negative. You can't index into an array with a negative index in FORTRAN -- that's accessing invalid memory, a.k.a a seg fault.
The solution here is to constrain the index:
bin_low = max(1, floor(NBins * (x1 - NBins) / (NBins)))
That way, if the formula gives you a negative bin, you access the first bin instead. (Remember, FORTRAN is 1-indexed)