Search code examples
matlabperformancefile-iodynamic-arrayslarge-data

When writing a large array directly to disk in MATLAB, is there any need to preallocate?


I need to write an array that is too large to fit into memory to a .mat binary file. This can be accomplished with the matfile function, which allows random access to a .mat file on disk.

Normally, the accepted advice is to preallocate arrays, because expanding them on every iteration of a loop is slow. However, when I was asking how to do this, it occurred to me that this may not be good advice when writing to disk rather than RAM.

Will the same performance hit from growing the array apply, and if so, will it be significant when compared to the time it takes to write to disk anyway?

(Assume that the whole file will be written in one session, so the risk of serious file fragmentation is low.)


Solution

  • Q: Will the same performance hit from growing the array apply, and if so will it be significant when compared to the time it takes to write to disk anyway?

    A: Yes, performance will suffer if you significantly grow a file on disk without pre-allocating. The performance hit will be a consequence of fragmentation. As you mentioned, fragmentation is less of a risk if the file is written in one session, but will cause problems if the file grows significantly.

    A related question was raised on the MathWorks website, and the accepted answer was to pre-allocate when possible.

    If you don't pre-allocate, then the extent of your performance problems will depend on:

    • your filesystem (how data are stored on disk, the cluster-size),
    • your hardware (HDD seek time, or SSD access times),
    • the size of your mat file (whether it moves into non-contiguous space),
    • and the current state of your storage (existing fragmentation / free space).

    Let's pretend that you're running a recent Windows OS, and so are using the NTFS file-system. Let's further assume that it has been set up with the default 4 kB cluster size. So, space on disk gets allocated in 4 kB chunks and the locations of these are indexed to the Master File Table. If the file grows and contiguous space is not available then there are only two choices:

    1. Re-write the entire file to a new part of the disk, where there is sufficient free space.
    2. Fragment the file, storing the additional data at a different physical location on disk.

    The file system chooses to do the least-bad option, #2, and updates the MFT record to indicate where the new clusters will be on disk.

    Illustration of fragmented file on NTFS, from WindowsITPro

    Now, the hard disk needs to physically move the read head in order to read or write the new clusters, and this is a (relatively) slow process. In terms of moving the head, and waiting for the right area of disk to spin underneath it ... you're likely to be looking at a seek time of about 10ms. So for every time you hit a fragment, there will be an additional 10ms delay whilst the HDD moves to access the new data. SSDs have much shorter seek times (no moving parts). For the sake of simplicity, we're ignoring multi-platter systems and RAID arrays!

    If you keep growing the file at different times, then you may experience a lot of fragmentation. This really depends on when / how much the file is growing by, and how else you are using the hard disk. The performance hit that you experience will also depend on how often you are reading the file, and how frequently you encounter the fragments.

    MATLAB stores data in Column-major order, and from the comments it seems that you're interested in performing column-wise operations (sums, averages) on the dataset. If the columns become non-contiguous on disk then you're going to hit lots of fragments on every operation!

    As mentioned in the comments, both read and write actions will be performed via a buffer. As @user3666197 points out the OS can speculatively read-ahead of the current data on disk, on the basis that you're likely to want that data next. This behaviour is especially useful if the hard disk would be sitting idle at times - keeping it operating at maximum capacity and working with small parts of the data in buffer memory can greatly improve read and write performance. However, from your question it sounds as though you want to perform large operations on a huge (too big for memory) .mat file. Given your use-case, the hard disk is going to be working at capacity anyway, and the data file is too big to fit in the buffer - so these particular tricks won't solve your problem.

    So ...Yes, you should pre-allocate. Yes, a performance hit from growing the array on disk will apply. Yes, it will probably be significant (it depends on specifics like amount of growth, fragmentation, etc). And if you're going to really get into the HPC spirit of things then stop what you're doing, throw away MATLAB , shard your data and try something like Apache Spark! But that's another story.

    Does that answer your question?

    P.S. Corrections / amendments welcome! I was brought up on POSIX inodes, so sincere apologies if there are any inaccuracies in here...