I am using multiprocessing
to do a bunch of parallel computing, and returning the results as a list. The may issue right now is the function often runs out of memory. My current code is like
def some_func(a)
b = a**2 (some calculations here)
return b
with multiprocessing.Pool() as p:
results_list = p.map(partial(some_func), a_list_of_a)
with open(f"results/final_data.pkl", "wb") as f:
pickle.dump(results_list, f)
I think one solution might be split the a_list_of_a
into several batches, and write the results as batches to the disk to reduce the memory usage. Is it possible?
Edited:
The main issue of my some_func
is it returns a large variable. A networkx
graph, specifically. The input a_list_a
is a list of node numbers. And the return of some_func
is graphs generated by given nodes. The length of a_list_of_a
is around 15,000. My memory is 64GB but the function is still killed due to out of memory.
Maybe there is a better way to store networkx
Graph
? I do need to keep the nodes' attribute when storing the Graph
object.
You want to get a much better feel for how memory is being consumed,
and being released.
The example some_func
clearly allocates far less RAM than your "real" function does.
Consider using a more portable data format than pickle().
For example, CSV or JSONL, or an RDBMS,
or binary format like Parquet or HDF5.
The trouble with pickle is, if you rev library versions,
it's easy to wind up with unreadable binary files.
And even if it is readable, you have to read everything,
including large objects you might not need ATM.
results_list = p.map(partial(some_func), a_list_of_a)
Here, you're sending a "large" result object over a pipe to the parent, which allocates storage for it and then sends it to the filesystem. Much better to have each child do the FS I/O work and return a "done!" indicator. Name the files according the child's PID, or perhaps according to an input data hash.
You didn't really describe what "A" is,
nor whether some_func
takes a microsecond or a minute.
There's overhead incurred at each pipeline stage,
so consider batching the input tasks into some convenient size.
That might correspond to taking between 1 .. 10 seconds of processing time.
Consider using
.imap_unordered() instead.
If you don't care about the order, and don't care when
something happens as long as it happens, then give
the multiprocessing
module some scheduling flexibility;
do not unduly constrain it.
It's possible you'll need a post-processing phase to assemble results to put them in some required order, and that's fine. It may well take fewer elapsed minutes to complete all that.