I thought about a custom sparse data format. It's meant for space efficient storage and loading, not to do computations on it. The essence is to store the indices and values of nonzero entries. I was wondering if there are some tweaks that could improve the performance.
The need emerged from handling data like this: N "images" (32x32) with four channels each. The images contain on average ~5% nonzero values. As N grows very large, storing all the images in RAM is inefficient. So only the number of nonzero entries, their indices and values as well as the original shape is stored.
Here's a example how this can be implemented:
import numpy as np
def disassemble_data(data):
# get some dense data and make it sparse
lengths = np.count_nonzero(data, axis=(1, 2, 3))
idxs = np.flatnonzero(data)
vals = data.ravel()[idxs]
return lengths, idxs, vals, data.shape
def assemble_data(lengths, idxs, vals, shape):
# get some sparse data and make it dense
data = np.zeros(shape)
lower_idx = 0
for length in lengths:
upper_idx = lower_idx + length
data.ravel()[idxs[lower_idx:upper_idx]] = vals[lower_idx:upper_idx]
lower_idx = upper_idx
return data
# Create some dummy data
my_data = np.random.uniform(0, 1, (10, 4, 32, 32))
my_data[my_data > 0.05] = 0
# make data sparse and then dense again
my_new_data = assemble_data(*disassemble_data(my_data))
# assert that this actually works
assert np.allclose(my_data, my_new_data)
Now, we can directly see the advantage: The data is densified image by image. This allows us to load the whole dataset into RAM and generate dense images on demand via a generator:
def image_generator(lengths, idxs, vals, shape):
idxs %= np.prod(shape[1:])
lower_idx = 0
for length in lengths:
upper_idx = lower_idx + length
data = np.zeros(shape[1:])
data.ravel()[idxs[lower_idx:upper_idx]] = vals[lower_idx:upper_idx]
lower_idx = upper_idx
yield data
Further it is also possible to generate batches of images:
def image_batch_generator(lengths, idxs, vals, shape, batch_size):
idxs %= np.prod((batch_size, *shape[1:]))
lengths = np.sum(lengths.reshape(-1, batch_size), axis=1)
lower_idx = 0
for length in lengths:
upper_idx = lower_idx + length
data = np.zeros((batch_size, *shape[1:]))
data.ravel()[idxs[lower_idx:upper_idx]] = vals[lower_idx:upper_idx]
lower_idx = upper_idx
yield data
This is a rather convenient approach for my needs. Yet I was wondering if it is possible to speed this up.
E.g. I saw that numpys itemset is faster than direct assignment (according to the docs). But it only works for a single item, not for index arrays.
Are there any other approaches? I'm not at all familiar with cython and such, so I'd be happy to get some hints!
I tested a bit how this can be done more efficiently and came to the point that your approach isn't bad at all for highly uncorrelated data produced by np.random.uniform. On real data this can be a bit different.
I improved the speed of your functions a bit giving about 1.4GB/s compressing and 1.2GB/s decompressing, which isn't bad at all. With h5py (blosclz) I could achieve only about 450MB/s, but writing the data also to disk.
Improved sparse algorithm
import numpy as np
import numba as nb
#We can use uint16 on (4,32,32), since max. idx<2**16
@nb.jit()
def to_sparse_data_uint16(data):
data_flat=data.reshape(-1)
idx=np.empty(data.size,dtype=np.uint16)
data_out=np.empty(data.size,dtype=data.dtype)
ii=0
for i in range(data_flat.shape[0]):
if (data_flat[i]!=0):
idx[ii]=i
data_out[ii]=data_flat[i]
ii+=1
return idx[0:ii], data_out[0:ii], data.shape
def to_dense_data(idx,data,shape):
length=np.prod(shape)
data_out=np.zeros(length,dtype=data.dtype)
data_out[idx]=data
return data_out.reshape(shape)
########################
#do you really need float64 here?
images = np.array(np.random.uniform(0, 1, (100000, 4, 32, 32)),dtype=np.float32)
images[images > 0.05] = 0.
res=[]
t1=time.time()
for i in range(100000):
res.append(to_sparse_data_uint16(images[i,:,:,:]))
print(time.time()-t1)
t1=time.time()
for i in range(100000):
data=to_dense_data(res[i][0],res[i][1],res[i][2])
print(time.time()-t1)
HDF5 Example
import numpy as np
import tables #register blosc
import h5py as h5
import h5py_cache as h5c
import time
# I assume here that you don't need float64 for images..
# 1650MB Testdata
images = np.array(np.random.uniform(0, 1, (100000, 4, 32, 32)),dtype=np.float32)
images[images > 0.05] = 0.
#Write data (32,7 GB uncompressed)
hdf5_path='Test.h5'
f = h5c.File(hdf5_path, 'w',chunk_cache_mem_size=1024**2*100) #200 MB cache size
dset_images = f.create_dataset("images", shape=(20*100000, 4, 32, 32),dtype=np.float32,chunks=(1000, 4, 32, 32),compression=32001,compression_opts=(0, 0, 0, 0, 9, 1, 1), shuffle=False)
t1=time.time()
#Don't call h5py to often, this will lead to bad performance
for i in range(20):
dset_images[i*100000:(i+1)*100000,:,:,:]=images
f.close()
print(time.time()-t1)
print("MB/s: " + str(32700/(time.time()-t1)))