r foreach parallel-processing parallel-foreach

Parallel processing data analysis - Is there a benefit to having more splits than processor cores?

I am using a naive Bayesian classifier to predict some test data in R. The test data has >1,000,000,000 records, and takes far too long to process with one processor. The computer I am using has (only) four processors in total, three of which I can free-up to run my task (I could use all four, but prefer to keep one for other work I need to do).

Using the foreach and doSNOW packages, and following this tutorial, I have things set up and running. My question is:

I have the dataset split into three parts, one part per processor. Is there a benefit to splitting the dataset into say 6,9, or 12 parts? In other words, what is the trade-off between more splits, vs, just having one big block of records for each processor core to run?

I haven't provided any data here because I think this question is more theoretical. But if data are needed, please let me know.

Solution

Broadly speaking, the advantage of splitting it up into more parts is that you can optimize your processor use.

If the dataset is split into 3 parts, one per processor, and they take the following time:

Split A - 10 min

Split B - 20 min

Split C - 12 min

You can see immediately that two of your processors are going to be idle for a significant amount of time needed to do the full analysis.

Instead, if you have 12 splits, each one taking between 3 and 6 minutes to run, then processor A can pick up another chunk of the job after it finishes with the first one instead of idling until the longest-running split finishes.