Search code examples
pythonbinaryfiles

Reading large binary files (>2GB) with python


I am writing a program to process some binary files. I used to use numpy.fromfile and everything worked fine until I came across some big binary file (>2gb) since numpy can't read them (memory problems) after trying unsuccesfully with h5py since I didn't get how to convert my files to h5 files. I was trying to use open(), read() and struct.unpack_from, in order to reconstruct the data as I would have done in c++.

My binary files represent 32 bit floats that are to be paired into 64bit complex.

The problem at the moment is that even if from the info I gathered struct.unpack_from() should return a tuple with all the datas of the specified type in the file it only returns the first element of the file:

The code:

f1 = open(IQ_File, 'rb')
a1 = f1.read()       
f = struct.unpack_from('f', a1)
print(f)

What I am expecting here is an output with the binary back to floats, however my output is only:

(-0.057812511920928955,)

-- a tuple containing only the first float of the file.

I really don't understand what I am doing wrong here. What should I be doing differently?


Solution

  • Pack/unpack format strings can have each item prefixed with a number to have that many items packed/unpacked. Just divide the data size by the size of float and put that number in the format:

    nf = len(a1) // struct.calcsize('f')
    f = struct.unpack(f"{nf}f", a1)
    

    Mind that tuples are very ineffective way to store numeric array data in Python. On 64-bit systems (e.g., macOS) with CPython, a tuple of N floats uses 24+N*8 bytes (sizeof(PyObject_VAR_HEAD) + N pointers) for the tuple itself plus N*24 bytes (sizeof(PyObject_HEAD) + one double) for the floats (stored internally as doubles), or 24+N*32 bytes in total. That's 8 times more than the size of the binary data!

    A better option is to use numpy.fromfile() and explicitly provide the count and possibly offset arguments in order to read the file in chunks. If you need to know in advance how many floats in total are there in the file, use os.stat():

    nf = os.stat(IQ_File).st_size // struct.calcsize('f')