Search code examples
pythonperformancenumpybinaryfiles

improve speed when reading a binary file


I have a large binary file that I want to read in an array. The format of the binary files is:

  • there is an extra data of 4 bytes at the start and end of each row that I'm not using;
  • in between I have 8 bytes values

I'm doing it like this:

        # nlines - number of row in the binary file
        # ncols - number of values to read from a row

        fidbin=open('toto.mda' ,'rb'); #open this file
        temp = fidbin.read(4)  #skip the first 4 bytes
        nvalues = nlines * ncols   # Total number of values

        array=np.zeros(nvalues,dtype=np.float)

        #read ncols values per line and skip the useless data at the end
        for c in range(int(nlines)): #read the nlines of the *.mda file
            matrix = np.fromfile(fidbin, np.float64,count=int(ncols)) #read all the values from one row
            Indice_start = c*ncols
            array[Indice_start:Indice_start+ncols]=matrix
            fidbin.seek( fidbin.tell() + 8) #fid.tell() the actual read position + skip bytes (4 at the end of the line + 4 at the beginning of the second line)
       fidbin.close()

It works well but the problem is that is very slow for large binary file. Is there a way to increase the reading speed of the binary file?


Solution

  • You can use a structured data type and read the file with a single call to numpy.fromfile. For example, my file qaz.mda has five columns of floating point values between the four byte markers at the start and end of each row. Here's how you can create a structured data type and read the data.

    First, create a data type that matches the format of each row:

    In [547]: ncols = 5
    
    In [548]: dt = np.dtype([('pre', np.int32), ('data', np.float64, ncols), ('post', np.int32)])
    

    Read the file into a structured array:

    In [549]: a = np.fromfile("qaz.mda", dtype=dt)
    
    In [550]: a
    Out[550]: 
    array([(1, [0.0, 1.0, 2.0, 3.0, 4.0], 0),
           (2, [5.0, 6.0, 7.0, 8.0, 9.0], 0),
           (3, [10.0, 11.0, 12.0, 13.0, 14.0], 0),
           (4, [15.0, 16.0, 17.0, 18.0, 19.0], 0),
           (5, [20.0, 21.0, 22.0, 23.0, 24.0], 0)], 
          dtype=[('pre', '<i4'), ('data', '<f8', (5,)), ('post', '<i4')])
    

    Pull out just the data that we want:

    In [551]: data = a['data']
    
    In [552]: data
    Out[552]: 
    array([[  0.,   1.,   2.,   3.,   4.],
           [  5.,   6.,   7.,   8.,   9.],
           [ 10.,  11.,  12.,  13.,  14.],
           [ 15.,  16.,  17.,  18.,  19.],
           [ 20.,  21.,  22.,  23.,  24.]])
    

    You could also experiment with numpy.memmap to see if it improves performance:

    In [563]: a = np.memmap("qaz.mda", dtype=dt)
    
    In [564]: a
    Out[564]: 
    memmap([(1, [0.0, 1.0, 2.0, 3.0, 4.0], 0),
           (2, [5.0, 6.0, 7.0, 8.0, 9.0], 0),
           (3, [10.0, 11.0, 12.0, 13.0, 14.0], 0),
           (4, [15.0, 16.0, 17.0, 18.0, 19.0], 0),
           (5, [20.0, 21.0, 22.0, 23.0, 24.0], 0)], 
          dtype=[('pre', '<i4'), ('data', '<f8', (5,)), ('post', '<i4')])
    
    In [565]: data = a['data']
    
    In [566]: data
    Out[566]: 
    memmap([[  0.,   1.,   2.,   3.,   4.],
           [  5.,   6.,   7.,   8.,   9.],
           [ 10.,  11.,  12.,  13.,  14.],
           [ 15.,  16.,  17.,  18.,  19.],
           [ 20.,  21.,  22.,  23.,  24.]])
    

    Note that data above is still a memory-mapped array. To ensure that the data is copied to an array in memory, numpy.copy can be used:

    In [567]: data = np.copy(a['data'])
    
    In [568]: data
    Out[568]: 
    array([[  0.,   1.,   2.,   3.,   4.],
           [  5.,   6.,   7.,   8.,   9.],
           [ 10.,  11.,  12.,  13.,  14.],
           [ 15.,  16.,  17.,  18.,  19.],
           [ 20.,  21.,  22.,  23.,  24.]])
    

    Whether or not that is necessary depends on how you will use the array in the rest of your code.