Garbage collecting python subprocesses

tl;dr: I have tasks with huge return values that consume a lot of memory. I'm submitting them to a concurrent.futures.ProcessPoolExecutor. The subprocesses hold onto the memory until they receive a new task. How do I force subprocesses to effectively garbage collect themselves?

Example

import concurrent.futures
import time

executor = concurrent.futures.ProcessPoolExecutor(max_workers=1)

def big_val():
    return [{1:1} for i in range(1, 1000000)]

future = executor.submit(big_val)

# do something with future result

In the above example I'm creating a large object in a subprocess then working with the result. From this point onward, I can deal with the memory in the parent process, but the subprocess created by my ProcessPoolExecutor will hold onto the memory allocated for my task indefinitely.

What I've tried

Honestly, the only thing I can think of is submitting a dummy task:

def donothing():
    pass

executor.submit(donothing)

This works, but is a) pretty clunky and more importantly b) untrustworthy, because I don't have guarantees about which subprocess I'm sending tasks to, so the only foolproof way is to send a flood to ensure the subprocesses I care about get a copy.

As far as I can tell, as soon as a worker process has finished running my task, it has no reason to hold onto the result. If my parent process assigned the returned a Future to a local variable, then the moment the task was completed the return value would be copied to the Future in the parent, meaning the worker has no further need for it. If my parent process didn't do this, then the return value is effectively discarded anyway.

Am I misunderstanding something here, or is this just an unfortunate quirk of how subprocesses reference memory? If so, is there a better workaround?

Solution

Your dummy task approach is the only way to accomplish this without significant code refactoring (to avoid returning the huge value at all).

The problem is that the worker process binds the result to a local name r before sending it back to the parent, and only replaces r when a new task comes in.

You could reasonably open an enhancement/bug request on the CPython bug tracker to have the worker explicitly del r after calling _sendback_result; it already does this for call_item (the packaged up function and arguments sent to the worker) for exactly the same reason, to avoid holding onto resources beyond their window of usefulness, and it makes sense to do the same thing for the already returned and no longer relevant result.