Search code examples
matlabperformancebinaryfiles

Loading chunks of large binary files in Matlab, quickly


I have some pretty massive data files (256 channels, on the order of 75-100 million samples = ~40-50 GB or so per file) in int16 format. It is written in flat binary format, so the structure is something like: CH1S1,CH2S1,CH3S1 ... CH256S1,CH1S2,CH2S2,...

I need to read in each channel separately, filter and offset correct it, then save. My current bottleneck is loading each channel, which takes about 7-8 minutes... scale that up 256 times, and I'm looking at nearly 30 hours just to load the data! I am trying to intelligently use fread, to skip bytes as I read each channel; I have the following code in a loop over all 256 channels to do this:

offset = i - 1;
fseek(fid,offset*2,'bof');
dat = fread(fid,[1,nSampsTotal],'*int16',(nChan-1)*2);

Reading around, this is typically the fastest way to load parts of a large binary file, but is the file simply too large to do this any faster?

I'm not loading that much data... the test file I'm working with is 37GB, for one of the 256 channels, I'm only loading 149MB for the entire trace... maybe the 'skip' functionality of fread is suboptimal?

System details: MATLAB 2017a, Windows 7, 64bit, 32GB RAM


Solution

  • @CrisLuengo's idea was much faster: essentially, chunking the data, loading each chunk and then splitting that out to separate channel files to save RAM.

    Here is some code for just the loading part which is fast, less than 1 minute:

    % fake raw data
    disp('building... ');
    nChan = 256;
    nSampsTotal = 10e6;
    tic; DATA = rand(nChan,nSampsTotal); toc;
    fid = fopen('rawData.dat','w');
    disp('writing flat binary file... ');
    tic; fwrite(fid,DATA(:),'int16'); toc;
    fclose(fid);
    
    % compute the number of samples and chunks
    chunkSize = 1e6;
    nChunksTotal = ceil(nSampsTotal/chunkSize);
    
    
    %% load by chunks
    t1 = tic;
    fid = fopen('rawData.dat','r');
    dat = zeros(nChan,chunkSize,'int16');
    chunkCnt = 1;
    while 1
        tic
        if chunkCnt <= nChunksTotal
            % load the data
            fprintf('Chunk %02d/%02d: loading... ',chunkCnt,nChunksTotal);
            dat = fread(fid,[nChan,chunkSize],'*int16');
        else
            break;
        end
        toc;
        chunkCnt = chunkCnt + 1;
    end
    t = toc(t1); fprintf('Total time: %4.2f secs.\n\n\n',t);
    % Total time: 55.07 secs.
    fclose(fid);
    

    On the other hand, loading by channel by skipping through the file takes about 20x longer, a little over 20 minutes:

    %% load by channels (slow)
    t1 = tic;
    fid = fopen('rawData.dat','r');
    dat = zeros(1,nSampsTotal);
    for i = 1:nChan
        tic;
        fprintf('Channel %03d/%03d: loading... ');
        offset = i-1;
        fseek(fid,offset*2,'bof');
        dat = fread(fid,[1,nSampsTotal],'*int16',(nChan-1)*2);
        toc;
    end
    t = toc(t1); fprintf('Total time: %4.2f secs.\n\n\n',t);
    % Total time: 1133.48 secs.
    fclose(fid);
    

    I'd also like to thank OCDER on the Matlab forums for their help: link