Search code examples
pythonpython-3.xnumpylzma

Reading binary file using numpy fromfile in combination with lzma open


I have a binary file which contains double values (64bit floating point data). Using numpy fromfile

>>> data1 = numpy.fromfile(open('myfile', 'rb'))

I receive the correct data (I get the same data with data1 = numpy.fromfile('myfile'))

>>> data1
array([  1.29000000e-07,   3.70000000e-08,   3.80000000e-08,
     3.70000000e-08,   3.60000000e-08,   3.80000000e-08,
     3.80000000e-08,   3.70000000e-08,   3.80000000e-08,
     3.60000000e-08,   3.80000000e-08,   3.70000000e-08,
     3.60000000e-08,   3.60000000e-08,   3.80000000e-08,
     3.50000000e-08,   3.80000000e-08,   3.80000000e-08,
     3.80000000e-08,   3.60000000e-08,   3.70000000e-08,
     3.60000000e-08,   3.70000000e-08,   3.70000000e-08,
     3.60000000e-08,   3.50000000e-08,   3.70000000e-08,
     3.70000000e-08,   3.60000000e-08,   3.50000000e-08,
     3.80000000e-08,   3.80000000e-08,   3.60000000e-08,
     3.50000000e-08,   3.90000000e-08,   3.70000000e-08,
     3.70000000e-08,   3.70000000e-08,   3.50000000e-08,
     3.70000000e-08,   3.60000000e-08,   3.70000000e-08,
     3.80000000e-08,   3.90000000e-08,   3.90000000e-08,
     3.60000000e-08,   3.60000000e-08,   3.70000000e-08,
     3.60000000e-08,   3.80000000e-08,   3.70000000e-08,
     3.50000000e-08,   3.50000000e-08,   3.60000000e-08,
     3.60000000e-08,   3.70000000e-08,   3.50000000e-08,
     3.70000000e-08,   3.60000000e-08,   3.80000000e-08,
     3.80000000e-08,   3.80000000e-08,   3.80000000e-08,
     3.90000000e-08,   3.90000000e-08,   3.50000000e-08,
     3.80000000e-08,   3.80000000e-08,   3.70000000e-08,
     3.70000000e-08,   3.60000000e-08,   3.80000000e-08,
     3.60000000e-08,   3.70000000e-08,   3.70000000e-08,
     3.80000000e-08,   3.60000000e-08,   3.60000000e-08,
     3.50000000e-08,   3.80000000e-08,   3.60000000e-08,
     3.70000000e-08,   3.60000000e-08,   3.80000000e-08,
     3.50000000e-08,   3.80000000e-08,   3.70000000e-08,
     3.60000000e-08,   3.70000000e-08,   3.90000000e-08,
     3.60000000e-08,   3.60000000e-08,   3.90000000e-08,
     3.80000000e-08,   3.60000000e-08,   3.60000000e-08,
     3.70000000e-08,   3.70000000e-08])

I now compress this file using xz

xz -k myfile

and subsequently try reading the data in python using the lzma module

>>> data2 = numpy.fromfile(lzma.open('myfile.xz'))
>>> data2
array([  2.05244522e-289,   3.09873319e-303,  -9.10852154e-136,
     9.99900586e-150,  -7.22647881e+061,  -3.03508634e-168,
     1.40409926e+097,  -8.66961452e+219,   2.28992199e-308,
    -7.28706929e+173,   1.41101250e+029,  -2.94590886e-279,
     7.21680144e+171,  -4.62715868e+045,   3.05536517e-138,
    -2.94268247e-043,  -1.54563603e-295,   7.53024241e+102,
    -1.22865109e+263,   2.62485731e+044,   4.52556260e-312,
     1.18164036e-240,   3.56496646e-311,  -2.82751232e+286,
     1.69336097e+127])

Why is this happening? Looking at the content of the file object via read gives

>>> open('myfile', 'rb').read()
b'B$\xf7\xffgP\x81>\x1c\xe8l\xc4=\xddc>\xd1\x1e\xae#\xaefd>\x1c\xe8l\xc4=\xddc>g\xb1+e\xcdSc>\xd1\x1e\xae#\xaefd>\xd1\x1e\xae#\xaefd>\x1c\xe8l\xc4=\xddc>\xd1\x1e\xae#\xaefd>g\xb1+e\xcdSc>\xd1\x1e\xae#\xaefd>\x1c\xe8l\xc4=\xddc>g\xb1+e\xcdSc>g\xb1+e\xcdSc>\xd1\x1e\xae#\xaefd>\xb3z\xea\x05]\xcab>\xd1\x1e\xae#\xaefd>\xd1\x1e\xae#\xaefd>\xd1\x1e\xae#\xaefd>g\xb1+e\xcdSc>\x1c\xe8l\xc4=\xddc>g\xb1+e\xcdSc>\x1c\xe8l\xc4=\xddc>\x1c\xe8l\xc4=\xddc>g\xb1+e\xcdSc>\xb3z\xea\x05]\xcab>\x1c\xe8l\xc4=\xddc>\x1c\xe8l\xc4=\xddc>g\xb1+e\xcdSc>\xb3z\xea\x05]\xcab>\xd1\x1e\xae#\xaefd>\xd1\x1e\xae#\xaefd>g\xb1+e\xcdSc>\xb3z\xea\x05]\xcab>\x85U\xef\x82\x1e\xf0d>\x1c\xe8l\xc4=\xddc>\x1c\xe8l\xc4=\xddc>\x1c\xe8l\xc4=\xddc>\xb3z\xea\x05]\xcab>\x1c\xe8l\xc4=\xddc>g\xb1+e\xcdSc>\x1c\xe8l\xc4=\xddc>\xd1\x1e\xae#\xaefd>\x85U\xef\x82\x1e\xf0d>\x85U\xef\x82\x1e\xf0d>g\xb1+e\xcdSc>g\xb1+e\xcdSc>\x1c\xe8l\xc4=\xddc>g\xb1+e\xcdSc>\xd1\x1e\xae#\xaefd>\x1c\xe8l\xc4=\xddc>\xb3z\xea\x05]\xcab>\xb3z\xea\x05]\xcab>g\xb1+e\xcdSc>g\xb1+e\xcdSc>\x1c\xe8l\xc4=\xddc>\xb3z\xea\x05]\xcab>\x1c\xe8l\xc4=\xddc>g\xb1+e\xcdSc>\xd1\x1e\xae#\xaefd>\xd1\x1e\xae#\xaefd>\xd1\x1e\xae#\xaefd>\xd1\x1e\xae#\xaefd>\x85U\xef\x82\x1e\xf0d>\x85U\xef\x82\x1e\xf0d>\xb3z\xea\x05]\xcab>\xd1\x1e\xae#\xaefd>\xd1\x1e\xae#\xaefd>\x1c\xe8l\xc4=\xddc>\x1c\xe8l\xc4=\xddc>g\xb1+e\xcdSc>\xd1\x1e\xae#\xaefd>g\xb1+e\xcdSc>\x1c\xe8l\xc4=\xddc>\x1c\xe8l\xc4=\xddc>\xd1\x1e\xae#\xaefd>g\xb1+e\xcdSc>g\xb1+e\xcdSc>\xb3z\xea\x05]\xcab>\xd1\x1e\xae#\xaefd>g\xb1+e\xcdSc>\x1c\xe8l\xc4=\xddc>g\xb1+e\xcdSc>\xd1\x1e\xae#\xaefd>\xb3z\xea\x05]\xcab>\xd1\x1e\xae#\xaefd>\x1c\xe8l\xc4=\xddc>g\xb1+e\xcdSc>\x1c\xe8l\xc4=\xddc>\x85U\xef\x82\x1e\xf0d>g\xb1+e\xcdSc>g\xb1+e\xcdSc>\x85U\xef\x82\x1e\xf0d>\xd1\x1e\xae#\xaefd>g\xb1+e\xcdSc>g\xb1+e\xcdSc>\x1c\xe8l\xc4=\xddc>\x1c\xe8l\xc4=\xddc>'
>>> lzma.open('myfile.xz').read()
b'B$\xf7\xffgP\x81>\x1c\xe8l\xc4=\xddc>\xd1\x1e\xae#\xaefd>\x1c\xe8l\xc4=\xddc>g\xb1+e\xcdSc>\xd1\x1e\xae#\xaefd>\xd1\x1e\xae#\xaefd>\x1c\xe8l\xc4=\xddc>\xd1\x1e\xae#\xaefd>g\xb1+e\xcdSc>\xd1\x1e\xae#\xaefd>\x1c\xe8l\xc4=\xddc>g\xb1+e\xcdSc>g\xb1+e\xcdSc>\xd1\x1e\xae#\xaefd>\xb3z\xea\x05]\xcab>\xd1\x1e\xae#\xaefd>\xd1\x1e\xae#\xaefd>\xd1\x1e\xae#\xaefd>g\xb1+e\xcdSc>\x1c\xe8l\xc4=\xddc>g\xb1+e\xcdSc>\x1c\xe8l\xc4=\xddc>\x1c\xe8l\xc4=\xddc>g\xb1+e\xcdSc>\xb3z\xea\x05]\xcab>\x1c\xe8l\xc4=\xddc>\x1c\xe8l\xc4=\xddc>g\xb1+e\xcdSc>\xb3z\xea\x05]\xcab>\xd1\x1e\xae#\xaefd>\xd1\x1e\xae#\xaefd>g\xb1+e\xcdSc>\xb3z\xea\x05]\xcab>\x85U\xef\x82\x1e\xf0d>\x1c\xe8l\xc4=\xddc>\x1c\xe8l\xc4=\xddc>\x1c\xe8l\xc4=\xddc>\xb3z\xea\x05]\xcab>\x1c\xe8l\xc4=\xddc>g\xb1+e\xcdSc>\x1c\xe8l\xc4=\xddc>\xd1\x1e\xae#\xaefd>\x85U\xef\x82\x1e\xf0d>\x85U\xef\x82\x1e\xf0d>g\xb1+e\xcdSc>g\xb1+e\xcdSc>\x1c\xe8l\xc4=\xddc>g\xb1+e\xcdSc>\xd1\x1e\xae#\xaefd>\x1c\xe8l\xc4=\xddc>\xb3z\xea\x05]\xcab>\xb3z\xea\x05]\xcab>g\xb1+e\xcdSc>g\xb1+e\xcdSc>\x1c\xe8l\xc4=\xddc>\xb3z\xea\x05]\xcab>\x1c\xe8l\xc4=\xddc>g\xb1+e\xcdSc>\xd1\x1e\xae#\xaefd>\xd1\x1e\xae#\xaefd>\xd1\x1e\xae#\xaefd>\xd1\x1e\xae#\xaefd>\x85U\xef\x82\x1e\xf0d>\x85U\xef\x82\x1e\xf0d>\xb3z\xea\x05]\xcab>\xd1\x1e\xae#\xaefd>\xd1\x1e\xae#\xaefd>\x1c\xe8l\xc4=\xddc>\x1c\xe8l\xc4=\xddc>g\xb1+e\xcdSc>\xd1\x1e\xae#\xaefd>g\xb1+e\xcdSc>\x1c\xe8l\xc4=\xddc>\x1c\xe8l\xc4=\xddc>\xd1\x1e\xae#\xaefd>g\xb1+e\xcdSc>g\xb1+e\xcdSc>\xb3z\xea\x05]\xcab>\xd1\x1e\xae#\xaefd>g\xb1+e\xcdSc>\x1c\xe8l\xc4=\xddc>g\xb1+e\xcdSc>\xd1\x1e\xae#\xaefd>\xb3z\xea\x05]\xcab>\xd1\x1e\xae#\xaefd>\x1c\xe8l\xc4=\xddc>g\xb1+e\xcdSc>\x1c\xe8l\xc4=\xddc>\x85U\xef\x82\x1e\xf0d>g\xb1+e\xcdSc>g\xb1+e\xcdSc>\x85U\xef\x82\x1e\xf0d>\xd1\x1e\xae#\xaefd>g\xb1+e\xcdSc>g\xb1+e\xcdSc>\x1c\xe8l\xc4=\xddc>\x1c\xe8l\xc4=\xddc>'

which looks good to me. The types seem correct as well:

>>> type(data1)
<class 'numpy.ndarray'>
>>> type(data1[0])
<class 'numpy.float64'>

>>> type(data2)
<class 'numpy.ndarray'>
>>> type(data2[0])
<class 'numpy.float64'>

I expect the content of arrays data1 and data2 to be equal.


Solution

  • So, I don't know the why of the solution but I have one. I generated a file from the tofile method.

    I read the compressed version with frombuffer.

    data_xz = np.frombuffer(lzma.open('data.bin.xz', mode='rb').read())
    data_bin = np.fromfile('data.bin')
    

    and the data is equal upon reading.

    My guess is that somewhere, the handling of reading bytes by np.fromfile reveals to a difference in the plain read method and the one in the lzma module.

    Anyway, storing data is best done using consistent formats. For small data sets, plain text is ok. Else, there is joblib's persistence module or HDF5 for Python.