I have a huge h5 file which I need to extract each data-set into a separate csv file. The schema is something like /Genotypes/GroupN/SubGroupN/calls with 'N' groups and 'N' sub-groups. I have created sample h5 file with same structure as main file and tested the codes which worked correctly but when i apply the code on my main h5 file it encounters various errors. the schema of the HDF5 file:
/Genotypes
/genotype a
/genotype a_1 #one subgroup for each genotype group
/calls #data that I need to extract to csv file
depth #data
/genotype b
/genotype b_1 #one subgroup for each genotype group
/calls #data
depth #data
.
.
.
/genotype n #1500 genotypes are listed as groups
/genotype n_1
/calls
depth
/Positions
/allel #data
chromo #data#
/Taxa
/genotype a
/genotype a_1
/genotype b
/genotype b_1 #one subgroup for each genotype group
.
.
.
/genotype n #1500 genotypes are listed as groups
/genotype n_1
/_Data-Types_
Enum_Boolean
String_VariableLength
This is the code for creating sample h5 file:
import h5py
import numpy as np
ngrps = 2
nsgrps = 3
nds = 4
nrows = 10
ncols = 2
i_arr_dtype = ( [ ('col1', int), ('col2', int) ] )
with h5py.File('d:/Path/sample_file.h5', 'w') as h5w :
for gcnt in range(ngrps):
grp1 = h5w.create_group('Group_'+str(gcnt))
for scnt in range(nsgrps):
grp2 = grp1.create_group('SubGroup_'+str(scnt))
for dcnt in range(nds):
i_arr = np.random.randint(1,100, (nrows,ncols) )
ds = grp2.create_dataset('calls_'+str(dcnt), data=i_arr)
I used numpy
as below:
import h5py
import numpy as np
def dump_calls2csv(name, node):
if isinstance(node, h5py.Dataset) and 'calls' in node.name :
print ('visiting object:', node.name, ', exporting data to CSV')
csvfname = node.name[1:].replace('/','_') +'.csv'
arr = node[:]
np.savetxt(csvfname, arr, fmt='%5d', delimiter=',')
##########################
with h5py.File('d:/Path/sample_file.h5', 'r') as h5r :
h5r.visititems(dump_calls2csv) #NOTE: function name is NOT a string!
I also used PyTables
as below:
import tables as tb
import numpy as np
with tb.File('sample_file.h5', 'r') as h5r :
for node in h5r.walk_nodes('/',classname='Leaf') :
print ('visiting object:', node._v_pathname, 'export data to CSV')
csvfname = node._v_pathname[1:].replace('/','_') +'.csv'
np.savetxt(csvfname, node.read(), fmt='%5d', delimiter=',')
but I see error mentioned below for each method:
C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\python.exe C:\Users\...\PycharmProjects\DLLearn\datapreparation.py
visiting object: /Genotypes/Genotype a/genotye a_1/calls , exporting data to CSV
.
.
.
some of the datasets
.
.
.
Traceback (most recent call last):
File "C:\Users\...\PycharmProjects\DLLearn\datapreparation.py", line 31, in <module>
h5r.visititems(dump_calls2csv) #NOTE: function name is NOT a string!
File "C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\lib\site-packages\h5py\_hl\group.py", line 565, in visititems
return h5o.visit(self.id, proxy)
File "h5py\_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py\_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py\h5o.pyx", line 355, in h5py.h5o.visit
File "h5py\defs.pyx", line 1641, in h5py.defs.H5Ovisit_by_name
File "h5py\h5o.pyx", line 302, in h5py.h5o.cb_obj_simple
File "C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\lib\site-packages\h5py\_hl\group.py", line 564, in proxy
return func(name, self[name])
File "C:\Users\...\PycharmProjects\DLLearn\datapreparation.py", line 10, in dump_calls2csv
np.savetxt(csv_name, arr, fmt='%5d', delimiter=',')
File "<__array_function__ internals>", line 6, in savetxt
File "C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\lib\site-packages\numpy\lib\npyio.py", line 1377, in savetxt
open(fname, 'wt').close()
OSError: [Errno 22] Invalid argument: 'Genotypes_Genotype_Name-Genotype_Name2_calls.csv'
Process finished with exit code 1
and the error for the second code is:
C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\python.exe C:\Users\...\PycharmProjects\DLLearn\datapreparation.py
C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\lib\site-packages\tables\attributeset.py:308: DataTypeWarning: Unsupported type for attribute 'locked' in node 'Genotypes'. Offending HDF5 class: 8
value = self._g_getattr(self._v_node, name)
C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\lib\site-packages\tables\attributeset.py:308: DataTypeWarning: Unsupported type for attribute 'retainRareAlleles' in node 'Genotypes'. Offending HDF5 class: 8
value = self._g_getattr(self._v_node, name)
visiting object: /Genotypes/AlleleStates export data to CSV
Traceback (most recent call last):
File "C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\lib\site-packages\numpy\lib\npyio.py", line 1447, in savetxt
v = format % tuple(row) + newline
TypeError: %d format: a number is required, not numpy.bytes_
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\...\PycharmProjects\DLLearn\datapreparation.py", line 40, in <module>
np.savetxt(csvfname, node.read(), fmt= '%d', delimiter=',')
File "<__array_function__ internals>", line 6, in savetxt
File "C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\lib\site-packages\numpy\lib\npyio.py", line 1451, in savetxt
% (str(X.dtype), format))
TypeError: Mismatch between array dtype ('|S1') and format specifier ('%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,%d')
Process finished with exit code 1
Can anybody help me with this problem? Please mention the exact changes that I need to apply on codes and provide the complete code because my background is note coding it would be great if further explanations are provided.
I downloaded the example from your comments. This is a new answer based on my findings. If all calls
datasets have integer data, then fmt='%d'
format should work.
The only problem I found is invalid characters in the filename created from the group/dataset path. For example, :
and ?
are used in some group names. I modified dump_calls2csv()
to replace :
with -
, and replace ?
with #
.
Run this and you should get all calls
datasets written as CSV files. See new code below:
def dump_calls2csv(name, node):
if isinstance(node, h5py.Dataset) and 'calls' in node.name :
csvfname = node.name[1:] +'.csv'
csvfname = csvfname.replace('/','_') # create csv file name from path
csvfname = csvfname.replace(':','-') # modify invalid character
csvfname = csvfname.replace('?','#') # modify invalid character
print ('export data to CSV:', csvfname)
np.savetxt(csvfname, node[:], fmt='%d', delimiter=',')
I print csvfname
to confirm the character replacements are working as expected. Also, if there is an error on the name, it's helpful to identify the problem dataset.
Hope that helps. Be patient when you run this. When I test, about half of the CSV files have been written in 45 minutes.
At this point I think the only problem is the characters in the filename and NOT related to HDF5, h5py
or np.savetxt()
. For the general case (with any group/dataset names), there should be tests to check for any invalid filename characters.