I have 10GB of data in binary files with little endian format and I am converting them to integer like:
with open(myfile, 'rb') as inh:
data = inh.read()
for i in range(0, len(data), 4):
pos = struct.unpack('i', data[i:i+4])
But it takes really long time to convert each file with 100MB of data. Is there any way to speed the process up?
If you don't mind using numpy, you can use numpy.memmap
:
import numpy as np
data = np.memmap('foo.bin', dtype='<i4', mode='r')
For example,
In [122]: !hexdump foo.bin
0000000 01 00 00 00 02 00 00 00 03 00 00 00 ff 00 00 00
0000010 00 01 00 00 01 01 00 00 ff ff ff ff fe ff ff ff
0000020
In [123]: data = np.memmap('foo.bin', dtype='<i4', mode='r')
In [124]: data
Out[124]: memmap([ 1, 2, 3, 255, 256, 257, -1, -2], dtype=int32)
In [125]: data[6]
Out[125]: -1
It might not be necessary to memory-map the data, in which case you can simply read it into an array with numpy.fromfile
. For example,
In [129]: data = np.fromfile('foo.bin', dtype='<i4')
In [130]: data
Out[130]: array([ 1, 2, 3, 255, 256, 257, -1, -2], dtype=int32)