Search code examples
matlabmemorydynamicallocation

MatLab memory allocation when max size is unknown


I am trying to speed up a script that I have written in Matlab that dynamically allocates memory to a matrix (basicallly reads a line of data from a file and writes it into a matrix, then reads another line and allocates more memory for a larger matrix to store the next line). The reason I did this instead of preallocating memory using zeroes() or something was that I don't know the exact size the matrix needs to be to hold all of the data. I also don't know the maximum size of the matrix, so I can't just preallocate a max size and then get rid of memory that I didn't use. This was fine for small amounts of data, but now I need to scale my script up to read many millions of data points and this implementation of dynamic allocation is just much too slow.

So here is my attempt to speed up the script: I tried to allocate memory in large blocks using the zeroes function, then once the block is filled I allocate another large block. Here is some sample code:

data = [];   
count = 0;

for ii = 1:num_filelines    
   if mod(count, 1000) == 0  
       data = [data; zeroes(1000)];  %after 1000 lines are read, allocate another 1000 line
   end  
   data(ii, :) = line_read(file);  %line_read reads a line of data from 'file'
end

Unfortunately this doesn't work, when I run it I get an error saying "Error using vertcat Dimensions of matrices being concatenated are not consistent."

So here is my question: Is this method of allocating memory in large blocks actually any faster than incremental dynamic allocation, and also why does the above code not run? Thanks for the help.


Solution

  • What I recommend doing, if you know the number of lines and can just guess a large enough number of acceptable columns, use a sparse matrix.

    % create a sparse matrix
    mat = sparse(numRows,numCols)
    

    A sparse matrix will not store all of the zero elements, it only stores pointers to indices that are non-zero. This can help save a lot of space. They are used and accessed the same as any other matrix. That is only if you really need it in a matrix format from the beginning.

    If not, you can just do everything as a cell. Preallocate a cell array with as many elements as lines in your file.

    data = cell(1,numLines);
    % get matrix from line
    for i = 1:numLines
        % get matrix from line
        data{i} = lineData;
    end
    data = cell2mat(data);
    

    This method will put everything into a cell array, which can store "dynamically" and then be converted to a regular matrix.

    Addition

    If you are doing the sparse matrix method, to trim up your matrix once you are done, because your matrix will likely be larger than necessary, you can trim this down easily, and then cast it to a regular matrix.

    [val,~] = max(sum(mat ~= 0,2));
    mat(:,val:size(mat,2)) = [];
    mat = full(mat); % use this only if you really need the full matrix
    

    This will remove any unnecessary columns and then cast it to a full matrix that includes the 0 elements. I would not recommend casting it to a full matrix, as this requires a ton more space, but if you truly need it, use it.

    UPDATE

    To get the number of lines in a file easily, use MATLAB's perl interpretter

    create a file called countlines.pl and paste in the two lines below

    while (<>) {};
    print $.,"\n";
    

    Then you can run this script on your file as follows

    numLines = str2double(perl('countlines.pl','data.csv'));
    

    Problem solved.

    From MATLAB forum thread here

    remember it is always best to preallocate everything before hand, because technically when doing shai's method you are reallocating large amounts a lot, especially if it is a large file.