Search code examples
pythonperformancenumpystoragesparse-matrix

Performance of loading and storing sparse data with numpy


I thought about a custom sparse data format. It's meant for space efficient storage and loading, not to do computations on it. The essence is to store the indices and values of nonzero entries. I was wondering if there are some tweaks that could improve the performance.

The need emerged from handling data like this: N "images" (32x32) with four channels each. The images contain on average ~5% nonzero values. As N grows very large, storing all the images in RAM is inefficient. So only the number of nonzero entries, their indices and values as well as the original shape is stored.

Here's a example how this can be implemented:

import numpy as np


def disassemble_data(data):
    # get some dense data and make it sparse
    lengths = np.count_nonzero(data, axis=(1, 2, 3))
    idxs = np.flatnonzero(data)
    vals = data.ravel()[idxs]

    return lengths, idxs, vals, data.shape


def assemble_data(lengths, idxs, vals, shape):
    # get some sparse data and make it dense
    data = np.zeros(shape)

    lower_idx = 0
    for length in lengths:
        upper_idx = lower_idx + length
        data.ravel()[idxs[lower_idx:upper_idx]] = vals[lower_idx:upper_idx]
        lower_idx = upper_idx
    return data


# Create some dummy data 
my_data = np.random.uniform(0, 1, (10, 4, 32, 32))
my_data[my_data > 0.05] = 0

# make data sparse and then dense again
my_new_data = assemble_data(*disassemble_data(my_data))

# assert that this actually works
assert np.allclose(my_data, my_new_data)

Now, we can directly see the advantage: The data is densified image by image. This allows us to load the whole dataset into RAM and generate dense images on demand via a generator:

def image_generator(lengths, idxs, vals, shape):          
        idxs %= np.prod(shape[1:])

        lower_idx = 0
        for length in lengths:
            upper_idx = lower_idx + length
            data = np.zeros(shape[1:])
            data.ravel()[idxs[lower_idx:upper_idx]] = vals[lower_idx:upper_idx]
            lower_idx = upper_idx
            yield data

Further it is also possible to generate batches of images:

def image_batch_generator(lengths, idxs, vals, shape, batch_size):
    idxs %= np.prod((batch_size, *shape[1:]))
    lengths = np.sum(lengths.reshape(-1, batch_size), axis=1)

    lower_idx = 0
    for length in lengths:
        upper_idx = lower_idx + length
        data = np.zeros((batch_size, *shape[1:]))
        data.ravel()[idxs[lower_idx:upper_idx]] = vals[lower_idx:upper_idx]
        lower_idx = upper_idx
        yield data

This is a rather convenient approach for my needs. Yet I was wondering if it is possible to speed this up.

E.g. I saw that numpys itemset is faster than direct assignment (according to the docs). But it only works for a single item, not for index arrays.

Are there any other approaches? I'm not at all familiar with cython and such, so I'd be happy to get some hints!


Solution

  • I tested a bit how this can be done more efficiently and came to the point that your approach isn't bad at all for highly uncorrelated data produced by np.random.uniform. On real data this can be a bit different.

    I improved the speed of your functions a bit giving about 1.4GB/s compressing and 1.2GB/s decompressing, which isn't bad at all. With h5py (blosclz) I could achieve only about 450MB/s, but writing the data also to disk.

    Improved sparse algorithm

    import numpy as np
    import numba as nb
    
    #We can use uint16 on (4,32,32), since max. idx<2**16
    @nb.jit()
    def to_sparse_data_uint16(data):
      data_flat=data.reshape(-1)
    
      idx=np.empty(data.size,dtype=np.uint16)
      data_out=np.empty(data.size,dtype=data.dtype)
    
      ii=0
      for i in range(data_flat.shape[0]):
        if (data_flat[i]!=0):
          idx[ii]=i
          data_out[ii]=data_flat[i]
          ii+=1
    
      return idx[0:ii], data_out[0:ii], data.shape
    
    
    def to_dense_data(idx,data,shape):
      length=np.prod(shape)
      data_out=np.zeros(length,dtype=data.dtype)
    
      data_out[idx]=data
    
      return data_out.reshape(shape)
    
    
    ########################
    #do you really need float64 here?
    images = np.array(np.random.uniform(0, 1, (100000, 4, 32, 32)),dtype=np.float32)
    images[images > 0.05] = 0.
    
    res=[]
    t1=time.time()
    for i in range(100000):
      res.append(to_sparse_data_uint16(images[i,:,:,:]))
    
    print(time.time()-t1)
    
    t1=time.time()
    for i in range(100000):
      data=to_dense_data(res[i][0],res[i][1],res[i][2])
    
    print(time.time()-t1)
    

    HDF5 Example

    import numpy as np
    import tables #register blosc
    import h5py as h5
    import h5py_cache as h5c
    import time
    
    # I assume here that you don't need float64 for images..
    # 1650MB Testdata
    
    images = np.array(np.random.uniform(0, 1, (100000, 4, 32, 32)),dtype=np.float32)
    images[images > 0.05] = 0.
    
    #Write data (32,7 GB uncompressed)
    hdf5_path='Test.h5'
    f = h5c.File(hdf5_path, 'w',chunk_cache_mem_size=1024**2*100) #200 MB cache size
    dset_images = f.create_dataset("images", shape=(20*100000, 4, 32, 32),dtype=np.float32,chunks=(1000, 4, 32, 32),compression=32001,compression_opts=(0, 0, 0, 0, 9, 1, 1), shuffle=False)
    
    t1=time.time()
    #Don't call h5py to often, this will lead to bad performance
    for i in range(20):
        dset_images[i*100000:(i+1)*100000,:,:,:]=images
    
    f.close()
    print(time.time()-t1)
    
    print("MB/s: " + str(32700/(time.time()-t1)))