Search code examples
pythonnumpyhdf5h5pypytables

How to store this type of numpy array into HDF5, in each row there is an int and a numpy array of several ints, which varies in size for each row


My data looks like this

array([[0, array([ 4928722,  3922609, 14413953, 10103423,  8948498])],
       [1,
        array([12557217,  5572869, 13415223,  2532000, 14609022,  9830632,
        9800679,  7504595, 10752682])],
       [2,
        array([10458710,  7176517, 10268240,  4173086,  8617671,  4674075,
       12580461,  2434641,  3694004,  9734870,  1314108,  8879955,
        6468499, 12092464,  2962425, 13680848, 10590392, 10203584,
       12816205,  7484678,  7985600, 12896218, 14882024,  6783345,
         969850, 10709191,  4541728,  4312270,  6174902,   530425,
        4843145,  4838613, 11404068,  9900162, 10578750, 12955180,
        4602929,  4097386,  8870275,  7518195, 11849786,  2947773,
       11653892,  7599644,  5895991,  1381764,  5853764, 11048535,
       14128229, 11490202,   954680, 11998906,  9196156,  4506953,
        6597761,  7034485,  3008940,  9816877,  1748801, 10159466,
        2745090, 14842579,   788308,  5984365])],
       ...,
       [62711, array([ 6159359,  5003282, 11818909, 11760670])],
       [62712,
        array([ 4363069,  8566447,  9547966, 14554871,  2108131, 12207856,
       14840255, 13087558])],
       [62713,
        array([11252023,  8710787,  4233645, 11415316, 13888594,  7410770,
        2298432,  9330913, 13715351,  8284109,  9142809,  3099529,
       12366159, 10968492, 11123026,  1814941, 11209771, 10860521,
        1798095,  4389487,  4461271, 10070622,  3689125,   880863,
       13672430,  6677251, 10431890,  3447966, 12675925,   729773])]],
      dtype=object)

In each row there is an int, and then a numpy array of several ints; the size of the 2nd array can vary from 2-200 ints.

I am trying to figure out how to save this to hdf5.

I tried this method

import h5py
h5f = h5py.File('data.h5', 'w')
h5f.create_dataset('dataset_1', data=sampleDF, compression='gzip', compression_opts=9)

But I got this error

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-5-6667d439c206> in <module>()
      1 import h5py
      2 h5f = h5py.File('data.h5', 'w')
----> 3 h5f.create_dataset('dataset_1', data=sampleDF, compression='gzip', compression_opts=9)

1 frames
/usr/local/lib/python3.6/dist-packages/h5py/_hl/group.py in create_dataset(self, name, shape, dtype, data, **kwds)
    114         """
    115         with phil:
--> 116             dsid = dataset.make_new_dset(self, shape, dtype, data, **kwds)
    117             dset = dataset.Dataset(dsid)
    118             if name is not None:

/usr/local/lib/python3.6/dist-packages/h5py/_hl/dataset.py in make_new_dset(parent, shape, dtype, data, chunks, compression, shuffle, fletcher32, maxshape, compression_opts, fillvalue, scaleoffset, track_times)
     98         else:
     99             dtype = numpy.dtype(dtype)
--> 100         tid = h5t.py_create(dtype, logical=1)
    101 
    102     # Legacy

h5py/h5t.pyx in h5py.h5t.py_create()

h5py/h5t.pyx in h5py.h5t.py_create()

h5py/h5t.pyx in h5py.h5t.py_create()

TypeError: Object dtype dtype('O') has no native HDF5 equivalent

This looks like it's due to the varying length of the 2nd arrays, which cause the rows to be stored in with a dtype of 'object', which hdf5 does not recognize.

Is there a way to store this type of data in hdf5?

Here is code to reproduce the issue. It downloads an opens a small chunk of my data. I have also included a colab notebook so the user can quickly execute the code without downloading anything to their system.

https://colab.research.google.com/drive/1kaaYk5_xbzQcXTr_DhjuWQT_3S4E-rML

Full code:

import requests
import pickle
import numpy as np
import pandas as pd

def download_file_from_google_drive(id, destination):
    URL = "https://docs.google.com/uc?export=download"

    session = requests.Session()

    response = session.get(URL, params = { 'id' : id }, stream = True)
    token = get_confirm_token(response)

    if token:
        params = { 'id' : id, 'confirm' : token }
        response = session.get(URL, params = params, stream = True)

    save_response_content(response, destination)    

def get_confirm_token(response):
    for key, value in response.cookies.items():
        if key.startswith('download_warning'):
            return value

    return None

def save_response_content(response, destination):
    CHUNK_SIZE = 32768

    with open(destination, "wb") as f:
        for chunk in response.iter_content(CHUNK_SIZE):
            if chunk: # filter out keep-alive new chunks
                f.write(chunk)

download_file_from_google_drive('1-V6iSeGFlpiouerNDLYtG3BI4d5ZLMfu', 'sample.npy')
sampleDF = np.load('sample.npy', allow_pickle= True)

import h5py
h5f = h5py.File('data2.h5', 'w')
h5f.create_dataset('dataset_1', data=sampleDF, compression='gzip', compression_opts=9)

As was pointed out in the comments, hdpy has 'vlen' for handling ragged tensors. http://docs.h5py.org/en/stable/special.html#arbitrary-vlen-data

However, I do not know how to apply it. This is my attempt

h5f = h5py.File('data.h5', 'w')
dt = h5py.special_dtype(vlen=np.dtype('int32'))
h5f.create_dataset('dataset_1', data=sampleDF, dtype=dt, compression='gzip', compression_opts=9)

And this is the result

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
ValueError: Cannot return member number (operation not supported for type class)
Exception ignored in: 'h5py._proxy.make_reduced_type'
ValueError: Cannot return member number (operation not supported for type class)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-8-4256da5cbf76> in <module>()
      2 h5f = h5py.File('data2.h5', 'w')
      3 dt = h5py.special_dtype(vlen=np.dtype('int32'))
----> 4 h5f.create_dataset('dataset_1', data=new_array, dtype=dt, compression='gzip', compression_opts=9)

1 frames
/usr/local/lib/python3.6/dist-packages/h5py/_hl/group.py in create_dataset(self, name, shape, dtype, data, **kwds)
    114         """
    115         with phil:
--> 116             dsid = dataset.make_new_dset(self, shape, dtype, data, **kwds)
    117             dset = dataset.Dataset(dsid)
    118             if name is not None:

/usr/local/lib/python3.6/dist-packages/h5py/_hl/dataset.py in make_new_dset(parent, shape, dtype, data, chunks, compression, shuffle, fletcher32, maxshape, compression_opts, fillvalue, scaleoffset, track_times)
    141 
    142     if (data is not None) and (not isinstance(data, Empty)):
--> 143         dset_id.write(h5s.ALL, h5s.ALL, data)
    144 
    145     return dset_id

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/h5d.pyx in h5py.h5d.DatasetID.write()

h5py/_proxy.pyx in h5py._proxy.dset_rw()

h5py/_proxy.pyx in h5py._proxy.needs_proxy()

ValueError: Not a datatype (not a datatype)

Solution

  • As @kcw78 pointed out, store the columns separately.

    To store

    h5f = h5py.File('data.h5', 'w')
    dt = h5py.special_dtype(vlen=np.dtype('int32'))
    h5f.create_dataset('batch', data=sampleDF[:,1], dtype=dt, compression='gzip', compression_opts=9)
    h5f.create_dataset('labels', data=sampleDF[:,0].astype(np.int32), dtype=dt, compression='gzip', compression_opts=9)
    h5f.close()
    

    To open

    h5f2 = h5py.File('data.h5','r')
    resurrectedDF = np.column_stack(( h5f2['labels'][:] , h5f2['batch'][:] ))