I'm trying to use the parpool in Matlab.
I'm using the following code to start the pool and to run the script of choice, but after a little time i get a notification stating that the pool will close down.
The following basically checks to see which computer the code is running on and allocates the number of workers appropriately.
if strcmp( getenv('COMPUTERNAME'),'EEN-PC144')
parpool(4)
pctRunOnAll sim_img144
elseif strcmp( getenv('COMPUTERNAME'),'EEN-PC78')
parpool(16)
pctRunOnAll sim_img
elseif strcmp( getenv('COMPUTERNAME'),'EEN-PC244')
%parpool(1)
%pctRunOnAll sim_img
end
Parallel pool shut down notice screen shot:-
I'm trying to get multiple computers and their multiple cores to do a set of simulations. The simulations are done line by line that has no need to rely on the lines before it. I could post the simulation script itself but it's 300 lines long, and i've broken some of that in to their own scripts, which would mean more lines of code.
The simulation package (FieldII) that i'm using doesn't like it when the simulation is done in a parfor. Hence why i'm using the pctRunOnAll command. I'm led to believe that it should work by others who've supposedly got it to work.
Are there reasons why additional workers don't decide to work and sit idle? I can see in the processes that there is only one worker working away but i can see all 16 workers initialised.
sim_img and sim_imm144 are the exact same strict. I've copied sim_img and renamed it because Matlab has a feature that it autosaves and and autoloads the latest version. So if i make an experimental difference on a script on one computer, it automatically saves and loads it on the other. So as an insurance policy to ensure i don't lose work, i have two identical scripts running on each computer.
The sim_img(144) script loads FieldII. I then have a huge for loop that encapsulates the rest of the code. The for loop selects which simulation to do and the directory to save the result in. There are a few if and for statements that applies initialisation data to the simulator depending on what i ask it for. Now that it's all set up, it checks to see if a line of data has/is processed/ing by seeing if a result file has been written for that line. If not, it'll pre-allocate a file for that line by creating a file. It then does some last bits of initializations that relate to the current line and starts the simulation of that line. After the simulation of the line, it writes data to the pre-allocated file, and then goes back to check if another line needs to be simulated.
As far as i'm aware, it doesn't matter about the structure of my code because i think that i'm loading a copy of the entire script to each worker and each worker runs the entire script. The thing that allows it to be paralleled is the fact that i check to see if a result file for the current line is there and if not it's preallocated. I currently have two computers working in parallel on the current simulation - essentially the same as what i'm trying to get parpool to do.
So the question i ask, are there reasons why additional workers sit idle?
I've attempted to add as much relevant information as possible.
I did various different types of computation using parallel workers and came to the conclusion that my issue was to do with read and write delays to the HDD. The workers were fighting each other over who was doing which iteration, and so they thought that all iterations were complete because of this. The HDD is a network drive and sometimes acts a little weird. It shouldn't but it does.
There are various different bodges that i could do to get this working. The method that i chose was to set a time delay for each worker depending on which worker it was. To ensure there was no clash on the first batch of simulations, i went with a delay of 2 seconds between each worker.
I've now done several full simulations and it's worked perfectly. I've been using 32 workers, and waiting for up to about 1 minute before all workers are operating is more than acceptable since the parallel processing has reduced simulation time by approximately 90-95%. It's not perfect nor is it the most efficient, but it works.