This is what my dataframe looks like. The first column is a single int. The second column is a single list of 512 ints.
IndexID Ids
1899317 [0, 47715, 1757, 9, 38994, 230, 12, 241, 12228...
22861131 [0, 48156, 154, 6304, 43611, 11, 9496, 8982, 1...
2163410 [0, 26039, 41156, 227, 860, 3320, 6673, 260, 1...
15760716 [0, 40883, 4086, 11, 5, 18559, 1923, 1494, 4, ...
12244098 [0, 45651, 4128, 227, 5, 10397, 995, 731, 9, 3...
I saved it to hdf and tried opening it using
df.to_hdf('test.h5', key='df', data_columns=True)
h3 = h5py.File('test.h5')
I see 4 keys when I list the keys
h3['df'].keys()
KeysViewHDF5 ['axis0', 'axis1', 'block0_items', 'block0_values']
Axis1 sees to contain the values for the first column
h3['df']['axis1'][0:5]
array([ 1899317, 22861131, 2163410, 15760716, 12244098,
However, there doesn't seem to be data from the second column. There does is another column with other data
h3['df']['block0_values'][0][0:5]
But that doesn't seem to correspond to any of the data in the second column
array([128, 4, 149, 1, 0], dtype=uint8)
I am eventually trying to create a datastore that's memory mapped, that retrieves data using particular indices.
So something like
h3['df']['workingIndex'][22861131, 15760716]
would retrieve
[0, 48156, 154, 6304, 43611, 11, 9496, 8982, 1...],
[0, 40883, 4086, 11, 5, 18559, 1923, 1494, 4, ...
The problem is you're trying to serialize a Pandas Series of Python lists and it is not rectangular (it is jagged).
Pandas and HDF5 are largely used for rectangular (cube, hypercube, etc) data, not for jagged lists-of-lists.
Did you see this warning when you call to_hdf()
?
PerformanceWarning:
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block0_values] [items->['Ids']]
What it's trying to tell you is that lists-of-lists are not supported in an intuitive, high-performance way. And if you run an HDF5 visualization tool like h5dump
on your output file, you'll see what's wrong. The index (which is well-behaved) looks like this:
DATASET "axis1" {
DATATYPE H5T_STD_I64LE
DATASPACE SIMPLE { ( 5 ) / ( 5 ) }
DATA {
(0): 1899317, 22861131, 2163410, 15760716, 12244098
}
ATTRIBUTE "CLASS" {
DATA {
(0): "ARRAY"
}
}
But the values (lists of lists) look like this:
DATASET "block0_values" {
DATATYPE H5T_VLEN { H5T_STD_U8LE}
DATASPACE SIMPLE { ( 1 ) / ( H5S_UNLIMITED ) }
DATA {
(0): (128, 5, 149, 164, ...)
}
ATTRIBUTE "CLASS" {
DATA {
(0): "VLARRAY"
}
}
ATTRIBUTE "PSEUDOATOM" {
DATA {
(0): "object"
}
}
What's happening is exactly what the PerformanceWarning warned you about:
> PyTables will pickle object types that it cannot map directly to c-types
Your list-of-lists is being pickled and stored as H5T_VLEN which is just a blob of bytes.
Here are some ways you could fix this: