I am using a naive Bayesian classifier to predict some test data in R. The test data has >1,000,000,000 records, and takes far too long to process with one processor. The computer I am using has (only) four processors in total, three of which I can free-up to run my task (I could use all four, but prefer to keep one for other work I need to do).
Using the foreach
and doSNOW
packages, and following this tutorial, I have things set up and running. My question is:
I have the dataset split into three parts, one part per processor. Is there a benefit to splitting the dataset into say 6,9, or 12 parts? In other words, what is the trade-off between more splits, vs, just having one big block of records for each processor core to run?
I haven't provided any data here because I think this question is more theoretical. But if data are needed, please let me know.
Broadly speaking, the advantage of splitting it up into more parts is that you can optimize your processor use.
If the dataset is split into 3 parts, one per processor, and they take the following time:
Split A - 10 min
Split B - 20 min
Split C - 12 min
You can see immediately that two of your processors are going to be idle for a significant amount of time needed to do the full analysis.
Instead, if you have 12 splits, each one taking between 3 and 6 minutes to run, then processor A can pick up another chunk of the job after it finishes with the first one instead of idling until the longest-running split finishes.