How ensure subprocess is killed on timeout when using `run`?

I am using the following code to launch a subprocess :

# Run the program
subprocess_result = subprocess.run(
                cmd,
                stdout=subprocess.PIPE,
                stderr=subprocess.PIPE,
                check=False,
                timeout=timeout,
                cwd=directory,
                env=env,
                preexec_fn=set_memory_limits,
            )

The launched subprocess is also a Python program, with a shebang. This subprocess may last for longer than the specified timeout. The subprocess does heavy computations and write results in a file and does not contain any signal handler.

According to the documentation https://docs.python.org/3/library/subprocess.html#subprocess.run, subprocess.run kills a child that timesout :

The timeout argument is passed to Popen.communicate(). If the timeout expires, the child process will be killed and waited for. The TimeoutExpired exception will be re-raised after the child process has terminated.

When my subprocess timesout, I always receive the subprocess.TimeoutExpired exception, but from time to time the subprocess is not killed, hence still consuming resources on my machine.

So my question is, am I doing something wrong here ? If yes, what and if no, why do I have this issue and how can I solve it ?

Note : I am using Python 3.10 on Ubuntu 22_04

Solution

The most likely culprit for the behaviour you see is that the subprocess you are spawning is probably using multiprocessing and spawning its own child processes. Killing the parent process does not automatically kill the whole set of descendants. The granchildren are inherited by the init process (i.e. the process with PID 1) and will continue to run.

You can verify from the source code of suprocess.run :

with Popen(*popenargs, **kwargs) as process:
    try:
        stdout, stderr = process.communicate(input, timeout=timeout)
    except TimeoutExpired as exc:
        process.kill()
        if _mswindows:
            # Windows accumulates the output in a single blocking
            # read() call run on child threads, with the timeout
            # being done in a join() on those threads.  communicate()
            # _after_ kill() is required to collect that and add it
            # to the exception.
            exc.stdout, exc.stderr = process.communicate()
        else:
            # POSIX _communicate already populated the output so
            # far into the TimeoutExpired exception.
            process.wait()
        raise
    except:  # Including KeyboardInterrupt, communicate handled that.
        process.kill()
        # We don't call process.wait() as .__exit__ does that for us.
        raise

Here you can see at line 550 the timeout is set on the communicate call, if it fires at line 552 the subprocess is .kill()ed. The kill method sends a SIGKILL which immediately kills the subprocess without any cleanup. It's a signal that cannot be caught by the subprocess, so it's not possible that the child is somehow ignoring it.

The TimeoutException is then re-raised at line 564, so if your parent process sees this exception the subprocess is already dead.

This however says nothing of granchildren processes. Those will continue to run as children of PID 1.

I don't see any way in which you can customize how subprocess.run handles subprocess termination. For example, if it used SIGTERM instead of SIGKILL you could modify your child process or write a wrapper process that will catch the signal and properly kill all its descendants. But SIGKILL doesn't give you this luxury.

So I believe that for your use case you cannot use the subprocess.run facade but you should use Popen directly. You can look at the subprocess.run implementation and take just the things that you need, maybe dropping support for platforms you don't use.

Note: There are extremely rare situations in which the subprocesses won't die immediately on SIGKILL. I believe the only situation in which this happens is if the subprocess is performing a very long system call or other kernel operation, which might not be interrupted immediately. If the operation is in deadlock this might prevent the process from terminating forever. However I don't think that this is your case, since you did not mention that the process is stuck doing nothing, but from what you said the process simply seems to continue running.