Search code examples
pythonhdf5astropyfitsdata-cube

Creating a HDF5 datacube in Python with multiple FITS files


I am currently struggling with a problem. I have 1500 fits files that contain 3800 x 3800 arrays. My objective is to create a single HDF5 datacube with them. Unfortunately I cannot provide with the fits files (due to storage problems). So far, I have been able to create an empty HDF5 array with the required shape (3800, 3800, 1500) by doing:

import h5py    
from astropy.io import fits
import glob

outname = "test.hdf5"
NAXIS1 = 3800
NAXIS2 = 3800
NAXIS3 = 1500
f = h5py.File(outname,'w')
dataset = f.create_dataset("DataCube",
                           (NAXIS1,NAXIS2,NAXIS3),
                           dtype=np.float32)
f.close()

but I am having trouble trying to write the arrays from the fits files, because each element from the following for loop takes 30 mins at least:

f = h5py.File(outname,'r+')
# This is the actual line, but I will replace it by random noise
# in order to make the example reproducible.
# fitslist = glob.glob("*fits") # They are 1500 fits files

for i in range(NAXIS3): 
    # Again, I replace the real data with noise.
    # hdul = fits.open(fitslist[i])
    # file['DataCube'][:,:,i] = hdul[0].data     
    data = np.random.normal(0,1,(dim0,dim1))    
    file['DataCube'][:,:,i] = data
f.close()

There is any better way to construct a 3D datacube made of N slices which are already stored in N fits files? I was expecting that once the HDF5 file was created in the disc, writing it would be quite fast, but it wasn't.

Thank you very much for your help.

EDIT 1: I tested the modification proposed by astrofrog and it worked really well. Now the performance is quite good. In addition to this, I stored several fits files (~50) into one temporary numpy array in order to reduce the number of times I write into the hdf5 file. Now the code looks like this:

NAXIS1 = len(fitslist)
NAXIS2 = fits_0[ext].header['NAXIS1']
NAXIS3 = fits_0[ext].header['NAXIS2']
shape_array = (NAXIS2, NAXIS3)
print(shape_array)

f = h5py_cache.File(outname, 'w', chunk_cache_mem_size=3000*1024**2,
                    libver='latest')

dataset = f.create_dataset("/x", (NAXIS1, NAXIS2, NAXIS3),
                           dtype=np.float32)

cache_size = 50
cache_array = np.empty(shape=(cache_size, NAXIS2, NAXIS3))
j = 0
for i in tqdm(range(len(fitslist))):
    print(fitslist[i])
    hdul = fits.getdata(fitslist[i], ext)
    cache_array[j:j+1, :, :] = hdul

    if ((i % cache_size == 0) & (i != 0)):
        print("Writing to disc")
        f['/x'][i-cache_size+1:i+1, :, :] = cache_array
        j = 0
    if (i % 100 == 0):
        print("collecting garbage")
        gc.collect()
    j = j + 1
f.close()

My question is: There is any more pythonic way of doing this? I am not sure is this is the most efficient way of writing files with h5py, or if there is any better way to read from fits to numpy and then to hdf5.


Solution

  • I think the issue might be that the order of the dimensions should be NAXIS3, NAXIS2, NAXIS1 (currently I think it is doing very inefficient striding over the array). I would also only add the array to the HDF5 file at the end:

    import glob
    
    import h5py
    import numpy as np
    from astropy.io import fits
    
    fitslist = glob.glob("*.fits")
    
    NAXIS1 = 3800
    NAXIS2 = 3800
    NAXIS3 = 1000
    
    array = np.zeros((NAXIS3, NAXIS2, NAXIS1), dtype=np.float32)
    
    for i in range(NAXIS3):
        hdul = fits.open(fitslist[i], memmap=False)
        array[i, :, :] = hdul[0].data
    
    f = h5py.File('test.hdf5', 'w')
    f.create_dataset("DataCube", data=array)
    f.close()
    

    If you need the array in the NAXIS1, NAXIS2, NAXIS3 order, just transpose it at the very end.