python docker multiprocessing vscode-debugger

Multiprocessing Pool Loses Variable Values on Some Runs

I have code similar to this:

import multiprocessing

const = "ABC"
val_list = [1, 2, 3]

args = [{"val": val, "const": const} for val in val_list]
    
with multiprocessing.Pool() as p:
    p.map(functionWrapper, args)

When I run with a long val_list, e.g., 2,000 entries, then for some of the multiprocessing runs, the val variable turns into "val" instead of a number, or the const variable turns into "const" instead of "ABC". Why does this happen and how do I stop it? The extra-strange thing is that the issue with the val only happens when I'm running the code in Docker, and the issue with the const only happens when I'm running in the VSCode debugger, so it seems like a memory allocation issue with multiprocessing. When I run the val_list in sequence without multiprocessing, I get no errors.

Solution

You explain that given args like

[{'val': 1, 'const': 'ABC'},
 {'val': 2, 'const': 'ABC'},
 {'val': 3, 'const': 'ABC'}]

occasionally functionWrapper will unexpectedly receive the literal string 'const', or 'val', as input argument.

Well, that's Bad.

We want to know where those literals came from. Doing the experiment of changing it to 'konst' would presumably show that new string in the buggy output.

Let's think about the pool's actions. It serialized the args list and sent portions of that to each child for deserialization and processing. There is something about your production setup that is more complicated than the nice simple code posted here. I'm not sure your vals are all numeric. Imagine they are of mixed type, or even that they're all strings.

If the serializer / deserializer does not properly roundtrip list elements with crazy unicode strings, or other datatypes, then there is the possibility for it to get out of sync. We sometimes see such an effect when roundtripping a dataframe through CSV file format, where escaping newlines and delimiters isn't completely effective. Sounds a little far fetched, but it's a working theory.

Also, data values larger than a 4 KiB pipe buffer size should work fine but might expose unusual sync timing behavior.

Going even farther out on a limb, perhaps something is racing and is managing to swallow / discard characters from the pipe that the child is reading. Verify that your production code doesn't have any code attempting to interact in that way. Boil it down to simpler and and simpler test cases, still more complex than what you posted, until the symptom no longer manifests.

This is all speculative, but it points to avenues that can be explored to confirm / reject possible failure modes.

You appear to be using the pool in the correct way -- it should be fine to send such data values down to worker processes. That is the normal way to use the multiprocessing module.

Performing initialization early or late, before or after workers are spawned, is one typical gotcha with this module. You might open(), then fork(), and the file is open in both parent and child. Sometimes it is helpful to lazily defer such actions until after child starts, so each child has its own private connection to a file.

action item:

Rather than serializing the bulk data of interest, persist that data and just pass row IDs or filenames down to the children, who use that to retrieve the bulk data on their own. Doing that experiment will help you better understand the failure mode and the per-environment details that trigger it.

Good luck!