Search code examples
pythonhdf5h5py

How to access images in a dataset in hdf5 format?


I accessed a hyperspectral dataset from the following website http://microbia.org/index.php/resources. It is called "dataset.hdf5". I am trying to explore the data inside

import numpy as np
import h5py

hf=h5py.File("dataset.hdf5", 'r')
hf.keys()

Output:

<KeysViewHDF5 ['CSSs', 'IMGs', 'SEGMs', 'agarFootprint', 'circularity', 'convexity', 'hemolysis', 'inertia', 'labels', 'labelsPathogens', 'positions', 'sizes', 'waves']>

dataset_IMGs= hf['IMGs']
dataset_IMGs[:]

Output:

array([b'IMG_WLATRIO_51145900_T1080_TW0H1S1',
       b'IMG_WLATRIO_51145900_T1080_TW0H1S1',
       b'IMG_WLATRIO_51145900_T1080_TW0H1S1', ...,
       b'IMG_WLATRIO_51144600_T1080_TW0H1S1',
       b'IMG_WLATRIO_51144600_T1080_TW0H1S1',
       b'IMG_WLATRIO_51144600_T1080_TW0H1S1'], dtype='|S35') 

My target is to actually extract those images in their original format, but what I see above is some kind of binary encoding. I searched and tried scripts I found but none worked to help me extract those images.

Does anyone have an idea as to what and how to extract these images?


Solution

  • I agree with @jacub. This file doesn't appear to have any image data in it. I used a utility to get a summary of the datasets and their contents. IMGs is an array of file names This is what I found:

    C:\Users\walker\Downloads>ptdump dataset.hdf5
    / (RootGroup) ''
    /CSSs (Array(10398, 125)) ''
    /IMGs (Array(10398,)) ''
    /SEGMs (Array(10398,)) ''
    /agarFootprint (Array(10398, 125)) ''
    /circularity (Array(10398,)) ''
    /convexity (Array(10398,)) ''
    /hemolysis (Array(10398,)) ''
    /inertia (Array(10398,)) ''
    /labels (Array(10398,)) ''
    /labelsPathogens (Array(10398,)) ''
    /positions (Array(10398, 2)) ''
    /sizes (Array(10398,)) ''
    /waves (Array(125,)) ''
    

    The link has this comment about the file: "The hyperspectral database contains a selected collection of spectral signatures from bacteria colonies on solid blood agar plates. ... The database has the aim to offer a first benchmark to assess image analysis algorithms performances for this application."

    You can get raw image data using the links under this heading: MicrobIA Images Dataset (Beta ver. 0.1) MicrobIA_Dataset...sample.zip has 20 images in 4 folders. I'd start there. The other datasets seem to require an account/ login that I don't have.