Search code examples
pythonpandasnumpymultiprocessing

Parallelizing independent creation of Dataframes


I have a general question about parallelizing Dataframe operations.

Let's say I have an operation like this in mind (pseudo-Code, df, df1 and df2 are DataFrames)

df1 = pandas_operations(df, arg1, arg2)
df2 = pandas_operations(df, arg3, arg4)
result = pd.concat([df1, df2])

where pandas_operations is some function that makes heavy use of the pandas API to crunch numbers.

In what situations would creating df1 and df2 in parallel (e.g. with multiprocessing) make sense in order to speed up the program?

I am mainly asking this question because pandas delegates a lot of computation heavy tasks to numpy, which (as I understand) already makes use of multiple cores when calling code written in C. If this were true, then I could parallelize the creation of df1 and df2 with multiprocessing, but the creation of each Dataframe would likely be slower than in the sequential program.


Solution

  • I do not want to make belive like i an an expert in that domain, but i think there is a significant difference between what numpy does, which is called SIMD for Single Instruction, Multiple Data, and that allows to vectorize operations for speed. This happends at the level of the architecture of the CPU. Multiprocessing is having several CPUs that can each do SIMD, from my understanding, at least.

    I think reading this post, from a more knowledgeable person, may help : Difference between SIMD and Multi-threading. It also speaks about differences between SIMD and multi-processing.

    So to answer the initial question, it will likely be faster (but this is very high level and would need to be quantified, i bet it would depend on the size and number of dataframes. Instanciating new pools for a few elements that are fast to process may be counter productive) to create df1 and df2 in parallel, but it will be more RAM intensive as df data has to be duplidated and passed to the two independant processes.