Search code examples
matlabparallel-processingparforsupercomputers

How to partition datasets into n blocks to reduce queue time on a supercomputer?


I have a dataset which includes approximately 2000 digital images. I am using MATLAB to perform some digital image processing to extract trees from the imagery. The script is currently configured to process the images in a parfor loop on n cores.

The challenge:
I have access to processing time on a University managed supercomputer with approximately 10,000 compute cores. If I submit the entire job for processing, I get put so far back in the tasking queue, a desktop computer could finish the job before the processing starts on the supercomputer. I have been told by support staff that partitioning the 2000 file dataset into ~100 file jobs will significantly decrease the tasking queue time. What method can I use to perform the tasks in parallel using the parfor loop, while submitting 100 files (of 2000) at a time?

My script is structured in the following way:

datadir = 'C:\path\to\input\files'
files = dir(fullfile(datadir, '*.tif'));
fileIndex = find(~[files.isdir]);

parfor ix = 1:length(fileIndex) 
     % Perform the processing on each file;
end

Solution

  • Similar to my comment I would spontaneously suggest something like

    datadir = 'C:\path\to\input\files'
    files = dir(fullfile(datadir, '*.tif'));
    files = files(~[files.isdir]);
    
    % split up the data
    N = length(files); % e.g. 20000
    jobSize = 100;
    jobFiles = mat2cell(files, [jobSize*ones(1,floor(N/jobSize)), mod(N,jobSize)]);
    jobNum = length(jobFiles);
    
    % Provide each job to a worker
    parfor jobIdx = 1:jobNum
        thisJob = jobFiles{jobIdx}; % this indexing allows matlab for transfering
                                    % only relevant file data to each worker
    
        for fIdx = 1:length(thisJob)
            thisFile = thisJob(fIdx);
            % Perform the processing on each file;
            thisFile.name
        end
    end