Search code examples
pythonarraysnumpycompressionlossless-compression

Compress numpy arrays efficiently


I tried various methods to do data compression when saving to disk some numpy arrays.

These 1D arrays contain sampled data at a certain sampling rate (can be sound recorded with a microphone, or any other measurment with any sensor) : the data is essentially continuous (in a mathematical sense ; of course after sampling it is now discrete data).

I tried with HDF5 (h5py) :

f.create_dataset("myarray1", myarray, compression="gzip", compression_opts=9)

but this is quite slow, and the compression ratio is not the best we can expect.

I also tried with

numpy.savez_compressed()

but once again it may not be the best compression algorithm for such data (described before).

What would you choose for better compression ratio on a numpy array, with such data ?

(I thought about things like lossless FLAC (initially designed for audio), but is there an easy way to apply such an algorithm on numpy data ?)


Solution

    1. Noise is incompressible. Thus, any part of the data that you have which is noise will go into the compressed data 1:1 regardless of the compression algorithm, unless you discard it somehow (lossy compression). If you have a 24 bits per sample with effective number of bits (ENOB) equal to 16 bits, the remaining 24-16 = 8 bits of noise will limit your maximum lossless compression ratio to 3:1, even if your (noiseless) data is perfectly compressible. Non-uniform noise is compressible to the extent to which it is non-uniform; you probably want to look at the effective entropy of the noise to determine how compressible it is.

    2. Compressing data is based on modelling it (partly to remove redundancy, but also partly so you can separate from noise and discard the noise). For example, if you know your data is bandwidth limited to 10MHz and you're sampling at 200MHz, you can do an FFT, zero out the high frequencies, and store the coefficients for the low frequencies only (in this example: 10:1 compression). There is a whole field called "compressive sensing" which is related to this.

    3. A practical suggestion, suitable for many kinds of reasonably continuous data: denoise -> bandwidth limit -> delta compress -> gzip (or xz, etc). Denoise could be the same as bandwidth limit, or a nonlinear filter like a running median. Bandwidth limit can be implemented with FIR/IIR. Delta compress is just y[n] = x[n] - x[n-1].

    EDIT An illustration:

    from pylab import *
    import numpy
    import numpy.random
    import os.path
    import subprocess
    
    # create 1M data points of a 24-bit sine wave with 8 bits of gaussian noise (ENOB=16)
    N = 1000000
    data = (sin( 2 * pi * linspace(0,N,N) / 100 ) * (1<<23) + \
        numpy.random.randn(N) * (1<<7)).astype(int32)
    
    numpy.save('data.npy', data)
    print os.path.getsize('data.npy')
    # 4000080 uncompressed size
    
    subprocess.call('xz -9 data.npy', shell=True)
    print os.path.getsize('data.npy.xz')
    # 1484192 compressed size
    # 11.87 bits per sample, ~8 bits of that is noise
    
    data_quantized = data / (1<<8)
    numpy.save('data_quantized.npy', data_quantized)
    subprocess.call('xz -9 data_quantized.npy', shell=True)
    print os.path.getsize('data_quantized.npy.xz')
    # 318380
    # still have 16 bits of signal, but only takes 2.55 bits per sample to store it