Search code examples
pythonnumpyserializationarchivenpz-file

What is the advantage of saving `.npz` files instead of `.npy` in python, regarding speed, memory and look-up?


The python documentation for the numpy.savez which saves an .npz file is:

The .npz file format is a zipped archive of files named after the variables they contain. The archive is not compressed and each file in the archive contains one variable in .npy format. [...]

When opening the saved .npz file with load a NpzFile object is returned. This is a dictionary-like object which can be queried for its list of arrays (with the .files attribute), and for the arrays themselves.

My question is: what is the point of numpy.savez?

Is it just a more elegant version (shorter command) to save multiple arrays, or is there a speed-up in the saving/reading process? Does it occupy less memory?


Solution

  • There are two parts of explanation for answering your question.

    I. NPY vs. NPZ

    As we already read from the doc, the .npy format is:

    the standard binary file format in NumPy for persisting a single arbitrary NumPy array on disk. ... The format is designed to be as simple as possible while achieving its limited goals. (sources)

    And .npz is only a

    simple way to combine multiple arrays into a single file, one can use ZipFile to contain multiple “.npy” files. We recommend using the file extension “.npz” for these archives. (sources)

    So, .npz is just a ZipFile containing multiple “.npy” files. And this ZipFile can be either compressed (by using np.savez_compressed) or uncompressed (by using np.savez).

    It's similar to tarball archive file in Unix-like system, where a tarball file can be just an uncompressed archive file which containing other files or a compressed archive file by combining with various compression programs (gzip, bzip2, etc.)

    II. Different APIs for binary serialization

    And Numpy also provides different APIs to produce these binary file output:

    • np.save ---> Save an array to a binary file in NumPy .npy format
    • np.savez --> Save several arrays into a single file in uncompressed .npz format
    • np.savez_compressed --> Save several arrays into a single file in compressed .npz format
    • np.load --> Load arrays or pickled objects from .npy, .npz or pickled files

    If we skim the source code of Numpy, under the hood:

    def _savez(file, args, kwds, compress, allow_pickle=True, pickle_kwargs=None):
        ...
        if compress:
            compression = zipfile.ZIP_DEFLATED
        else:
            compression = zipfile.ZIP_STORED
        ...
    
    
    def savez(file, *args, **kwds):
        _savez(file, args, kwds, False)
    
    
    def savez_compressed(file, *args, **kwds):
        _savez(file, args, kwds, True)
    

    Then back to the question:

    • If only use np.save, there is no more compression on top of the .npy format, only just a single archive file for the convenience of managing multiple related files.
    • If use np.savez_compressed, then of course less memory on disk because of more CPU time to do the compression job (i.e. a bit slower).