Search code examples
numpyhdf5h5pypytables

Multiple Errors During HDF5 to CSV conversion


I have a huge h5 file which I need to extract each data-set into a separate csv file. The schema is something like /Genotypes/GroupN/SubGroupN/calls with 'N' groups and 'N' sub-groups. I have created sample h5 file with same structure as main file and tested the codes which worked correctly but when i apply the code on my main h5 file it encounters various errors. the schema of the HDF5 file:

/Genotypes
    /genotype a
        /genotype a_1 #one subgroup for each genotype group
            /calls #data that I need to extract to csv file
            depth #data
    /genotype b
        /genotype b_1 #one subgroup for each genotype group
            /calls #data
            depth #data
    .
    .
    .
    /genotype n #1500 genotypes are listed as groups
        /genotype n_1
            /calls 
            depth

/Positions
    /allel #data 
    chromo #data#
/Taxa 
    /genotype a
        /genotype a_1
    /genotype b
        /genotype b_1 #one subgroup for each genotype group
    .
    .
    .
    /genotype n #1500 genotypes are listed as groups
        /genotype n_1

/_Data-Types_
    Enum_Boolean
    String_VariableLength

This is the code for creating sample h5 file:

import h5py  
import numpy as np  
    ngrps = 2  
    nsgrps = 3  
    nds = 4  
    nrows = 10  
    ncols = 2  
    
    i_arr_dtype = ( [ ('col1', int), ('col2', int) ] )  
    with h5py.File('d:/Path/sample_file.h5', 'w') as h5w :  
        for gcnt in range(ngrps):  
            grp1 = h5w.create_group('Group_'+str(gcnt))  
            for scnt in range(nsgrps):  
                grp2 = grp1.create_group('SubGroup_'+str(scnt))  
                for dcnt in range(nds):  
                    i_arr = np.random.randint(1,100, (nrows,ncols) )  
                    ds = grp2.create_dataset('calls_'+str(dcnt), data=i_arr)  

I used numpy as below:

import h5py
import numpy as np

def dump_calls2csv(name, node):    

    if isinstance(node, h5py.Dataset) and 'calls' in node.name :
       print ('visiting object:', node.name, ', exporting data to CSV')
       csvfname = node.name[1:].replace('/','_') +'.csv'
       arr = node[:]
       np.savetxt(csvfname, arr, fmt='%5d', delimiter=',')

##########################    

with h5py.File('d:/Path/sample_file.h5', 'r') as h5r :        
    h5r.visititems(dump_calls2csv) #NOTE: function name is NOT a string!

I also used PyTables as below:

import tables as tb
import numpy as np

with tb.File('sample_file.h5', 'r') as h5r :     
    for node in h5r.walk_nodes('/',classname='Leaf') :         
       print ('visiting object:', node._v_pathname, 'export data to CSV')
       csvfname = node._v_pathname[1:].replace('/','_') +'.csv'
       np.savetxt(csvfname, node.read(), fmt='%5d', delimiter=',')

but I see error mentioned below for each method:

 C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\python.exe C:\Users\...\PycharmProjects\DLLearn\datapreparation.py
visiting object: /Genotypes/Genotype a/genotye a_1/calls , exporting data to CSV
.
.
.
some of the datasets
.
.
.
Traceback (most recent call last):
  File "C:\Users\...\PycharmProjects\DLLearn\datapreparation.py", line 31, in <module>
    h5r.visititems(dump_calls2csv) #NOTE: function name is NOT a string!
  File "C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\lib\site-packages\h5py\_hl\group.py", line 565, in visititems
    return h5o.visit(self.id, proxy)
  File "h5py\_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py\_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py\h5o.pyx", line 355, in h5py.h5o.visit
  File "h5py\defs.pyx", line 1641, in h5py.defs.H5Ovisit_by_name
  File "h5py\h5o.pyx", line 302, in h5py.h5o.cb_obj_simple
  File "C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\lib\site-packages\h5py\_hl\group.py", line 564, in proxy
    return func(name, self[name])
  File "C:\Users\...\PycharmProjects\DLLearn\datapreparation.py", line 10, in dump_calls2csv
    np.savetxt(csv_name, arr, fmt='%5d', delimiter=',')
  File "<__array_function__ internals>", line 6, in savetxt
  File "C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\lib\site-packages\numpy\lib\npyio.py", line 1377, in savetxt
    open(fname, 'wt').close()
OSError: [Errno 22] Invalid argument: 'Genotypes_Genotype_Name-Genotype_Name2_calls.csv'

Process finished with exit code 1

and the error for the second code is:

C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\python.exe C:\Users\...\PycharmProjects\DLLearn\datapreparation.py
C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\lib\site-packages\tables\attributeset.py:308: DataTypeWarning: Unsupported type for attribute 'locked' in node 'Genotypes'. Offending HDF5 class: 8
  value = self._g_getattr(self._v_node, name)
C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\lib\site-packages\tables\attributeset.py:308: DataTypeWarning: Unsupported type for attribute 'retainRareAlleles' in node 'Genotypes'. Offending HDF5 class: 8
  value = self._g_getattr(self._v_node, name)
visiting object: /Genotypes/AlleleStates export data to CSV
Traceback (most recent call last):
  File "C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\lib\site-packages\numpy\lib\npyio.py", line 1447, in savetxt
    v = format % tuple(row) + newline
TypeError: %d format: a number is required, not numpy.bytes_

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\...\PycharmProjects\DLLearn\datapreparation.py", line 40, in <module>
    np.savetxt(csvfname, node.read(), fmt= '%d', delimiter=',')
  File "<__array_function__ internals>", line 6, in savetxt
  File "C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\lib\site-packages\numpy\lib\npyio.py", line 1451, in savetxt
    % (str(X.dtype), format))
TypeError: Mismatch between array dtype ('|S1') and format specifier ('%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,%d')

Process finished with exit code 1

Can anybody help me with this problem? Please mention the exact changes that I need to apply on codes and provide the complete code because my background is note coding it would be great if further explanations are provided.


Solution

  • I downloaded the example from your comments. This is a new answer based on my findings. If all calls datasets have integer data, then fmt='%d' format should work. The only problem I found is invalid characters in the filename created from the group/dataset path. For example, : and ? are used in some group names. I modified dump_calls2csv() to replace : with -, and replace ? with #. Run this and you should get all calls datasets written as CSV files. See new code below:

    def dump_calls2csv(name, node):         
        if isinstance(node, h5py.Dataset) and 'calls' in node.name :
           csvfname = node.name[1:] +'.csv'
           csvfname = csvfname.replace('/','_') # create csv file name from path
           csvfname = csvfname.replace(':','-') # modify invalid character
           csvfname = csvfname.replace('?','#') # modify invalid character
           print ('export data to CSV:', csvfname)
           np.savetxt(csvfname, node[:], fmt='%d', delimiter=',')
    

    I print csvfname to confirm the character replacements are working as expected. Also, if there is an error on the name, it's helpful to identify the problem dataset.

    Hope that helps. Be patient when you run this. When I test, about half of the CSV files have been written in 45 minutes.
    At this point I think the only problem is the characters in the filename and NOT related to HDF5, h5py or np.savetxt(). For the general case (with any group/dataset names), there should be tests to check for any invalid filename characters.