Search code examples
rhdf5bioconductor

Is it possible to update dataset dimensions in hdf5 file using rhdf5 in R?


I am trying to update 7 datasets within 1 group in an hdf5 file, but the updated datasets have different size dimensions than the originals (but the same dimensionality, ie 1D, 2D, and 3D). Is there a way to alter the dimension property in order to update the dataset? Alternatively, can I delete the previous group, and then create a new group in it's place? I'd rather not rebuild the entire h5 file (create file, create groups, create datasets) since it's decently complex.

I am using the Bioconductor rhdf5 package in R.

Example data:

# load package from bioconductor
source("http://bioconductor.org/biocLite.R")
biocLite("rhdf5")
library(rhdf5)

# create new h5 file and populate
created = h5createFile('example.h5')
created = h5createGroup('example.h5','foo')
h5write(matrix(1:10, nr=5, nc=2), 'example.h5', 'foo/A')

# updating dataset with data of same dimension is successful
h5write(matrix(11:20, nr= 5, nc = 2), 'example.h5', 'foo/A') 

# updating dataset with data of different dimension fails
h5write(matrix(1:12, nr= 6, nc = 2), 'example.h5', 'foo/A')

Note: I've read data from hdf5 files in the past, but this is my first time writing data back out into the file, so perhaps this is a naive expectation.


Solution

  • Unfortunately, the maximum size of an HDF5 dataset is fixed when it is created, and can't be increased afterwards. You're going to have to recreate at least the datasets you want to extend.

    HDF5 does allow you to "delete" a dataset, but this only involves unlinking it, i.e. it becomes inaccessible, but the space is not reclaimed. rhdf5 doesn't seem to provide an interface to this, however. Someone more familiar with rhdf5 may be able to help you there.

    You can set the maximum size in in rhdf5 with

    h5createDataset('example.h5', 'foo/A', c(10), maxdims=c(12))
    

    from the rhdf5 reference manual (PDF). If you want an unlimited maxdims, it's a bit more involved: you first have to create a dataspace using HDF5 constants and use that to create your dataset.