A python program (https://github.com/MannLabs/alphapeptdeep) created the following hdf5 file--> https://drive.google.com/file/d/1Ct2B7IU2WsqJfT3eGoR1xn3GSOffFqtN/view?usp=sharing
I can successfully open it up in HDFView, view metadata, and even view the data for floating point value fields.
However, for string fields, it simply gives an error (which I suspect is misleading).
For example, if one tries to view the data for this field:
/library/mod_seq_df/sequence
It gives the following (misleading?) error:
failed to read scalar dataset: Filter not available exception: Read Failed
I installed HDFView 3.1.4 on a clean debian 11 docker container. And I installed the HDF5 filter plugins as well from HDF5-1.14.0 installation scripts.
Thoughts?
After a little more investigation, I have some good news. I also found lots of challenges in the different HDF5 APIs.
First the good news. I can access the data in your file using h5py (a Python package). So, your HDF5 file appears to be fine. While the problems with HDFView are a headache, your errors are not caused by data corruption (or problems with compression filters).
This is what I have determined:
something_df
and have some
attributes that look like they "could be" Pandas attributes, this
file was not created by Pandas. On closer inspection, several Pandas
attributes you would expect to find are missing.is_pd_dataframe
that is saved as an 8-bit Enum (Boolean). Apparently PyTables doesn't support that datatype.variable length strings are not supported yet
. This is consistent with the Pandas error message in my earlier comment, and further confirmation the file probably wasn't created by Pandas.I included my Python code (below) that extracts some data. (I know you want to work in Java, but this confirms the data is accessible.) At this point I suggest 2 paths: 1) adding HDF5/Java tags to your SO question to see if the Java community has an answer, and/or 2) contact The HDF Group about HDFView problems (you can post a question on their forum at: https://forum.hdfgroup.org/).
Python/h5py solution:
import h5py
with h5py.File('predict.speclib.hdf','r') as h5f:
# read group attributes:
grp = h5f['/library/mod_seq_df']
print(f"is_pd_dataframe attribute value: {grp.attrs['is_pd_dataframe']}")
print(f"last_updated attribute value: {grp.attrs['last_updated']}")
print()
# read varlength string dataset:
ds = h5f['/library/mod_seq_df/sequence']
print(ds.shape, ds.dtype)
for i in range(0,5):
print(f'{i}: {ds[i]}')
for i in range(ds.shape[0]-5,ds.shape[0]):
print(f'{i}: {ds[i]}')
print()
# read float32 dataset:
ds = h5f['/library/fragment_intensity_df/y_z1']
print(ds.shape, ds.dtype)
for i in range(0,5):
print(f'{i}: {ds[i]}')
for i in range(ds.shape[0]-5,ds.shape[0]):
print(f'{i}: {ds[i]}')
Output looks like this:
is_pd_dataframe attribute value: True
last_updated attribute value: Sat Dec 31 16:26:42 2022
(21785,) object
0: b'YLQEREQR'
1: b'SMLRWMER'
2: b'FIQERFER'
3: b'ENFRECLR'
4: b'FLRLCHFK'
21780: b'SGSGNETPLALKSGGGGGGSQTPR'
21781: b'AAPLLAALTALLAAAAAGGDAPPGK'
21782: b'STAVPPVPGPGPGPGPGPGPGSTSR'
21783: b'GDPGDVGGPGPPGASGEPGAPGPPGK'
21784: b'GSIFGSGGGGMSGGGGGAGGGGGGSSHR'
(277994,) float32
0: 0.0
1: 0.0649685338139534
2: 0.012746012769639492
3: 0.036795008927583694
4: 0.12544597685337067
277989: 0.10976477712392807
277990: 0.06583086401224136
277991: 0.0935806930065155
277992: 0.08901204913854599
277993: 0.2575165033340454