Search code examples
pythonmatlabhdf5h5pymat-file

Differences between encoding of char and uint16 in .mat v7.3 files


I am trying to read v7.3 matlab .mat in python files using h5py.

I am encountering a problem where the representations of character arrays (e.g., typically, .mat fields containing a single string), and uint16 arrays, appear identical.

>> ushortarr = uint16([63 109 105 102])
>> strarr = 'gibl'
>> save('short_string_difference.mat', 'ushortarr', 'strarr', '-v7.3')

When loaded back into matlab, matlab is able to detect the correct data types of these variables:

>> ss73 = load('short_string_difference.mat')
ss73 =
       strarr: 'gibl'
       ushortarr: [69 109 105 102]

But h5py suggests that the structure of this file is as follows:

(Pdb) strarr
<HDF5 dataset "strarr": shape (4, 1), type "<u2">
(Pdb) ushortarr
<HDF5 dataset "ushortarr": shape (4, 1), type "<u2">
(Pdb) strarr.value
array([[103],
       [105],
       [ 98],
       [108]], dtype=uint16)
(Pdb) ushortarr.value
array([[ 69],
       [109],
       [105],
       [102]], dtype=uint16)

(I also checked and determined that octave has a similar behavior to h5py for v7.3 matlab files, but that both scipy.io.loadmat and octave have correct behavior for older, >=v7 .mat files. Looking through bug reports suggests that they don't a fix for this or a bunch of other problems with v7.3 mat files, and they don't officially support v7.3 at all)

My question is this: what data that h5py ignores, or other trick, is matlab using to determine the types of these variables when it loads them from this file? A secondary question is, is there a python implementation of a reader that can make this check whatever is used to make this determination?


Solution

  • You have to take a look at the attributes, which can be accessed via:

    strarr.attrs
    

    There you will find an attribute named MATLAB_class which is char or uint16