Search code examples
rforeachparallel-processingparallel-foreach

Parallel processing data analysis - Is there a benefit to having more splits than processor cores?


I am using a naive Bayesian classifier to predict some test data in R. The test data has >1,000,000,000 records, and takes far too long to process with one processor. The computer I am using has (only) four processors in total, three of which I can free-up to run my task (I could use all four, but prefer to keep one for other work I need to do).

Using the foreach and doSNOW packages, and following this tutorial, I have things set up and running. My question is:

I have the dataset split into three parts, one part per processor. Is there a benefit to splitting the dataset into say 6,9, or 12 parts? In other words, what is the trade-off between more splits, vs, just having one big block of records for each processor core to run?

I haven't provided any data here because I think this question is more theoretical. But if data are needed, please let me know.


Solution

  • Broadly speaking, the advantage of splitting it up into more parts is that you can optimize your processor use.

    If the dataset is split into 3 parts, one per processor, and they take the following time:

    Split A - 10 min

    Split B - 20 min

    Split C - 12 min

    You can see immediately that two of your processors are going to be idle for a significant amount of time needed to do the full analysis.

    Instead, if you have 12 splits, each one taking between 3 and 6 minutes to run, then processor A can pick up another chunk of the job after it finishes with the first one instead of idling until the longest-running split finishes.