I am using mac book and therefore, multiprocessing will use fork system call instead of spawning a new process. Also, I am using Python (with multiprocessing or Dask).
I have a very big pandas dataframe. I need to have many parallel subprocesses work with a portion of this one big dataframe. Let's say I have 100 partitions of this table that needs to be worked on in parallel. I want to avoid having to need to make 100 copies of this big dataframe as that will overwhelm memory. So the current approach I am taking is to partition it, save each partition to disk, and have each process read them in to process the portion each of them are responsible for. But this read/write is very expensive for me, and I would like to avoid it.
But if I make one global variable of this dataframe, then due to COW behavior, each process will be able to read from this dataframe without making an actual physical copy of it (as long as it does not modify it). Now the question I have is, if I make this one global dataframe and name it:
global my_global_df
my_global_df = one_big_df
and then in one of the subprocess I do:
a_portion_of_global_df_readonly = my_global_df.iloc[0:10]
a_portion_of_global_df_copied = a_portion_of_global_df_readonly.reset_index(drop=True)
# reset index will make a copy of the a_portion_of_global_df_readonly
do something with a_portion_of_global_df_copied
If I do the above, will I have created a copy of the entire my_global_df
or just a copy of the a_portion_of_global_df_readonly
, and thereby, in extension, avoided making copies of 100 one_big_df
?
One additional, more general question is, why do people have to deal with Pickle serialization and/or read/write to disk to transfer the data across multiple processes when (assuming people are using UNIX) setting the data as global variable will effectively make it available at all child processes so easily? Is there danger in using COW as a means to make any data available to subprocesses in general?
[Reproducible code from the thread below]
from multiprocessing import Process, Pool
import contextlib
import pandas as pd
def my_function(elem):
return id(elem)
num_proc = 4
num_iter = 10
df = pd.DataFrame(np.asarray([1]))
print(id(df))
with contextlib.closing(Pool(processes=num_proc)) as p:
procs = [p.apply_async(my_function, args=(df, )) for elem in range(num_iter)]
results = [proc.get() for proc in procs]
p.close()
p.join()
print(results)
Summarizing the comments, on a forking system such as Mac or Linux, a child process has a copy-on-write (COW) view of the parent address space, including any DataFrame
s that it may hold. It is safe to use and modify the dataframe in child processes without changing the data in the parent or other sibling child processses.
That means that it is unnecessary to serialize the dataframe to pass it to the child. All you need is the reference to the dataframe. For a Process
, you can just pass the reference directly
p = multiprocessing.Process(target=worker_fctn, args=(my_dataframe,))
p.start()
p.join()
If you use a Queue
or another tool such as a Pool
then the data will likely be serialized. You can use a global variable known to the worker but not actually passed to the worker to get around that problem.
What remains is the return data. It is in the child only and still needs to be serialized to be returned to the parent.