I have come across a Python sub-process issue that I have replicated on Python 3.6 and 3.7 that I do not understand. I have a program, call it Main, that launches an external process using subprocess.Popen(), call it "Slave". The Main program registers a SIGTERM signal handler. Main waits on the Slave process to complete using either proc.wait(None) or proc.wait(timeout). The Slave process can be interrupted by sending a SIGTERM signal to Main. The sigterm handler will send a SIGINT signal to the Slave and wait(30) for it to terminate. If Main is using wait(None), then the sigterm handler's wait(30) will wait the full 30 seconds even though the slave process has terminated. If Main is using the wait(timeout) version, then the sigterm handler's wait(30) will return as soon as the Slave terminates.
Here is a small test app that demonstrates the issue. Run it via python wait_test.py
to use the non-timeout wait(None). Run it via python wait_test.py <timeout value>
to provide a specific timeout to the Main wait.
Once the program is running, execute kill -15 <pid>
and see how the app reacts.
#
# Save this to a file called wait_test.py
#
import signal
import subprocess
import sys
from datetime import datetime
slave_proc = None
def sigterm_handler(signum, stack):
print("Process received SIGTERM signal {} while processing job!".format(signum))
print("slave_proc is {}".format(slave_proc))
if slave_proc is not None:
try:
print("{}: Sending SIGINT to slave.".format(datetime.now()))
slave_proc.send_signal(signal.SIGINT)
slave_proc.wait(30)
print("{}: Handler wait completed.".format(datetime.now()))
except subprocess.TimeoutExpired:
slave_proc.terminate()
except Exception as exception:
print('Sigterm Exception: {}'.format(exception))
slave_proc.terminate()
slave_proc.send_signal(signal.SIGKILL)
def main(wait_val=None):
with open("stdout.txt", 'w+') as stdout:
with open("stderr.txt", 'w+') as stderr:
proc = subprocess.Popen(["python", "wait_test.py", "slave"],
stdout=stdout,
stderr=stderr,
universal_newlines=True)
print('Slave Started')
global slave_proc
slave_proc = proc
try:
proc.wait(wait_val) # If this is a no-timeout wait, ie: wait(None), then will hang in sigterm_handler.
print('Slave Finished by itself.')
except subprocess.TimeoutExpired as te:
print(te)
print('Slave finished by timeout')
proc.send_signal(signal.SIGINT)
proc.wait()
print("Job completed")
if __name__ == '__main__':
if len(sys.argv) > 1 and sys.argv[1] == 'slave':
while True:
pass
signal.signal(signal.SIGTERM, sigterm_handler)
main(int(sys.argv[1]) if len(sys.argv) > 1 else None)
print("{}: Exiting main.".format(datetime.now()))
Here is an example of the two runs:
Note here the 30 second delay
--------------------------------
[mkurtz@localhost testing]$ python wait_test.py
Slave Started
Process received SIGTERM signal 15 while processing job!
slave_proc is <subprocess.Popen object at 0x7f79b50e8d90>
2022-03-30 11:08:15.526319: Sending SIGINT to slave. <--- 11:08:15
Slave Finished by itself.
Job completed
2022-03-30 11:08:45.526942: Exiting main. <--- 11:08:45
Note here the instantaneous shutdown
-------------------------------------
[mkurtz@localhost testing]$ python wait_test.py 100
Slave Started
Process received SIGTERM signal 15 while processing job!
slave_proc is <subprocess.Popen object at 0x7fa2412a2dd0>
2022-03-30 11:10:03.649931: Sending SIGINT to slave. <--- 11:10:03.649
2022-03-30 11:10:03.653170: Handler wait completed. <--- 11:10:03.653
Slave Finished by itself.
Job completed
2022-03-30 11:10:03.673234: Exiting main. <--- 11:10:03.673
These specific tests were run using Python 3.7.9 on CentOS 7. Can someone explain this behavior?
The Popen
class has an internal lock for wait
operations:
# Held while anything is calling waitpid before returncode has been
# updated to prevent clobbering returncode if wait() or poll() are
# called from multiple threads at once. After acquiring the lock,
# code must re-check self.returncode to see if another thread just
# finished a waitpid() call.
self._waitpid_lock = threading.Lock()
The major difference between wait()
and wait(timeout=...)
is that the former waits indefinitely while holding the lock, whereas the latter is a busy loop that releases the lock on each iteration.
if timeout is not None:
...
while True:
if self._waitpid_lock.acquire(False):
try:
...
# wait without any delay
(pid, sts) = self._try_wait(os.WNOHANG)
...
finally:
self._waitpid_lock.release()
...
time.sleep(delay)
else:
while self.returncode is None:
with self._waitpid_lock: # acquire lock unconditionally
...
# wait indefinitley
(pid, sts) = self._try_wait(0)
This isn't a problem for regular concurrent code – i.e. threading
– since the thread running wait()
and holding the lock will be woken up as soon as the subprocess finishes. This in turn allows all other threads waiting on the lock/subprocess to proceed promptly.
However, things look different when a) the main thread holds the lock in wait()
and b) a signal handler attempts to wait. A subtle point of signal handlers is that they interrupt the main thread:
signal
: Signals and ThreadsPython signal handlers are always executed in the main Python thread of the main interpreter, even if the signal was received in another thread. […]
Since the signal handler runs in the main thread, the main thread's regular code execution is paused until the signal handler finishes!
By running wait
in the signal handler, a) the signal handler blocks waiting for the lock and b) the lock blocks waiting for the signal handler. Only once the signal handler wait
times out does the "main thread" resume, receive confirmation that the suprocess finished, set the return code and release the lock.