Search code examples
pythonnumpylist-comprehensionbinaryfilespython-itertools

Is there a faster method for bin file to numpy array?


I am importing data from a .bin file to a numpy array using this code:

dt = np.dtype([('ShotNum', np.uint32), ('X', np.float32), ('Y', np.float32),\
        ('Z', np.float32),('inten', np.float32), ('refl', np.float32),\
        ('dopp', np.float32),('range', np.float32),('theta', np.float32),\
        ('phi', np.float32)])
data=np.fromfile('Data.bin',dtype=dt)
#Flatten Array and then recreate into array of desired dimension
flatdata=list(itertools.chain.from_iterable(data))
flatdata1=np.asarray(flatdata,dtype=float)
workdata=flatdata1.reshape(flatdata1.size//10,10)

it works, but it is slow. Specifically unpacking the tuple within the data structure in the line

flatdata=list(itertools.chain.from_iterable(data))

is super slow. Is there a way I can avoid creating this nested structure in the first place when importing data? and if not, is there a faster way to flatten?


Solution

  • Illustrating the use of chain in flattening a structured array:

    In [107]: data
    Out[107]: 
    array([( 1.,  2.,  3., 1), ( 1.,  2.,  3., 1), ( 1.,  2.,  3., 1)],
          dtype=[('a', '<f4'), ('b', '<f4'), ('c', '<f4'), ('d', '<i4')])
    In [108]: import itertools
    In [109]: list(itertools.chain.from_iterable(data))
    Out[109]: [1.0, 2.0, 3.0, 1, 1.0, 2.0, 3.0, 1, 1.0, 2.0, 3.0, 1]
    

    chain is a well established method of flattening a nested list.

    Turning a structured array into a 2d array is a bit tricky. view and astype work, sometimes, but the most reliable is another list approach:

    In [110]: data.tolist()
    Out[110]: [(1.0, 2.0, 3.0, 1), (1.0, 2.0, 3.0, 1), (1.0, 2.0, 3.0, 1)]
    In [111]: np.array(data.tolist())
    Out[111]: 
    array([[ 1.,  2.,  3.,  1.],
           [ 1.,  2.,  3.,  1.],
           [ 1.,  2.,  3.,  1.]])
    

    Making the equivalent array:

    In [115]: np.fromiter(itertools.chain.from_iterable(data),float).reshape(3,-1)
    Out[115]: 
    array([[ 1.,  2.,  3.,  1.],
           [ 1.,  2.,  3.,  1.],
           [ 1.,  2.,  3.,  1.]])
    

    tolist is faster:

    In [116]: timeit np.fromiter(itertools.chain.from_iterable(data),float).reshape
         ...: (3,-1)
    22 µs ± 329 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    In [117]: timeit np.array(data.tolist())
    5.8 µs ± 13.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
    

    As suggested in a comment we can enumerate the fields, and make an array from that:

    In [120]: [data[name] for name in data.dtype.names]
    Out[120]: 
    [array([ 1.,  1.,  1.], dtype=float32),
     array([ 2.,  2.,  2.], dtype=float32),
     array([ 3.,  3.,  3.], dtype=float32),
     array([1, 1, 1])]
    
    In [124]: np.array([data[name] for name in data.dtype.names]).T
    Out[124]: 
    array([[ 1.,  2.,  3.,  1.],
           [ 1.,  2.,  3.,  1.],
           [ 1.,  2.,  3.,  1.]])
    

    Similar time to the tolist approach:

    In [125]: timeit np.array([data[name] for name in data.dtype.names]).T
    6.94 µs ± 14.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)