Search code examples
rrhdf5

why reading group in hdf5 in r throws H5Identifier not valid?


I have downloaded data from archs rnaseq data. The human hdf5 file (28G). I want to access the expression data and group information. I am using the below code:

h5_exprs <- h5read("archs4_gene_human_v2.1.2.h5", "data/expression")

It throws

Error (scratch_11.R#9): Error in h5checktype(). H5Identifier not valid.

What should I do as extra step to solve the issue?

When I run h5ls("archs4_gene_human_v2.1.2.h5"), the output looks like this:

           group                  name       otype  dclass            dim
0              /                  data   H5I_GROUP                       
1          /data            expression H5I_DATASET INTEGER 620825 x 62548
2              /                  meta   H5I_GROUP                       
3          /meta                 genes   H5I_GROUP                       
4    /meta/genes           gene_symbol H5I_DATASET  STRING          62548
5          /meta               samples   H5I_GROUP                       
6  /meta/samples         aligned_reads H5I_DATASET INTEGER         620825
7  /meta/samples         channel_count H5I_DATASET  STRING         620825
8  /meta/samples   characteristics_ch1 H5I_DATASET  STRING         620825
9  /meta/samples       contact_address H5I_DATASET  STRING         620825
10 /meta/samples          contact_city H5I_DATASET  STRING         620825
11 /meta/samples       contact_country H5I_DATASET  STRING         620825
12 /meta/samples     contact_institute H5I_DATASET  STRING         620825
13 /meta/samples          contact_name H5I_DATASET  STRING         620825
14 /meta/samples           contact_zip H5I_DATASET  STRING         620825
15 /meta/samples       data_processing H5I_DATASET  STRING         620825
16 /meta/samples  extract_protocol_ch1 H5I_DATASET  STRING         620825
17 /meta/samples         geo_accession H5I_DATASET  STRING         620825
18 /meta/samples      instrument_model H5I_DATASET  STRING         620825
19 /meta/samples      last_update_date H5I_DATASET  STRING         620825
20 /meta/samples     library_selection H5I_DATASET  STRING         620825
21 /meta/samples        library_source H5I_DATASET  STRING         620825
22 /meta/samples      library_strategy H5I_DATASET  STRING         620825
23 /meta/samples          molecule_ch1 H5I_DATASET  STRING         620825
24 /meta/samples          organism_ch1 H5I_DATASET  STRING         620825
25 /meta/samples           platform_id H5I_DATASET  STRING         620825
26 /meta/samples              relation H5I_DATASET  STRING         620825
27 /meta/samples             series_id H5I_DATASET  STRING         620825
28 /meta/samples singlecellprobability H5I_DATASET   FLOAT         620825
29 /meta/samples       source_name_ch1 H5I_DATASET  STRING         620825
30 /meta/samples                sra_id H5I_DATASET  STRING         620825
31 /meta/samples                status H5I_DATASET  STRING         620825
32 /meta/samples       submission_date H5I_DATASET  STRING         620825
33 /meta/samples             taxid_ch1 H5I_DATASET  STRING         620825
34 /meta/samples                 title H5I_DATASET  STRING         620825
35 /meta/samples                  type H5I_DATASET  STRING         620825

Solution

  • I'm not sure of the cause of this error. I haven't downloaded the whole 28GB file, but if I'm able to read subsets of the /data/expression dataset directly from the S3 storage e.g.

    library(rhdf5)
    
    h5file <- 'https://s3.dev.maayanlab.cloud/archs4/archs4_gene_human_v2.1.2.h5'
    
    h5read(file = h5file, 
           name = "/data/expression", 
           index = list(1:10, 1:12),
           s3 = TRUE)
    
    #>       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
    #>  [1,]  353 1110    3    0    0   51    0    2    0     0   467     0
    #>  [2,]  342  873    2    0    1   33    1    5    0     0   388     0
    #>  [3,]  358 1171    1    0    0   41    0    5    0     0   391     0
    #>  [4,]  393  849    1    0    0   40    0    0    0     0   148     0
    #>  [5,]  427  821    0    0    0   30    0    0    0     0   112     0
    #>  [6,]  293  613    1    0    0   22    3    3    0     0   112     0
    #>  [7,]    0    0    0    1    0    0    0    0    0     0     0     0
    #>  [8,]    0    0    0    3    0    0    0    0    0     0     0     0
    #>  [9,]    1    0    0    5    0    0    0    0    0     0     0     0
    #> [10,]    0    0    0    3    0    0    0    0    0     0     0     0
    

    A few thoughts:

    • I presume that the h5read() command you've indicated is really what's found on line 9 of scratch_11.R?
    • You can try running h5errorHandling(type = "verbose") before running h5read(), which will give a larger HDF5 error stack and might help narrow down the issue.
    • Is it possible that the size of the data is a problem? Reading the whole dataset will require ~150GB RAM, although I'd expect R to produce an unable to allocate vector of size ... error if that was the issue.