Search code examples
pythonarrayscythondistutils

Cython fast conversion of binary string to int array


I have a large binary data file which I want to load into a C array for fast access. The data file just contains a sequence of 4 byte ints.

I get the data via the pkgutil.get_data function, which returns a binary string. the following code works:

import pkgutil
import struct

cdef int data[32487834]

def load_data():
    global data
    py_data = pkgutil.get_data('my_module', 'my_data')
    for i in range(32487834):
        data[i] = <int>struct.unpack('i', py_data[4*i:4*(i+1)])[0]
    return 0

load_data()

The problem is that this code is quite slow. Reading the whole data file can take 7 or 8 seconds. Reading the file directly into an array in C only takes 1-2 seconds, but I want to use pkgutil.get_data so that my module can reliably find the data whereever it gets installed.

So, my question is: what's the best way to do this? Is there a way to directly cast the data as an array of ints without all the calls to struct.unpack? And, as a secondary question, is there a way to simply get a pointer to the data to avoid copying 120MB of data unnecessarily?

Alternatively, is there a way to make pkgutil return the file path to the data instead of the data itself (in which case I can use C file IO to read the file quite quickly.

EDIT:

Just for the record, here's the final code used (based on Veedrac's answer):

import pkgutil

from cpython cimport array
import array

cdef int[:] data

cdef void load_data():
    global data
    py_data = pkgutil.get_data('my_module', 'my_data')
    data = array.array('i', py_data)

load_data()

Everything is quite fast.


Solution

  • Chances are you should really just use Numpy:

    import numpy
    import random
    import struct
    
    data = struct.pack('i'*100, *[random.randint(0, 1000000) for _ in range(100)])
    
    numpy.fromstring(data, dtype="int32")
    #>>> array([642029, 967046, 599565, ...etc], dtype=int32)
    

    Then just use any of the standard methods to get a pointer from that.

    If you want to avoid Numpy, a faster but less platform-agnostic method would be to go via a char pointer:

    cdef int *data_view = <int *><char *>data
    

    This has lots of "undefined"-ness to it, so be careful. Also be careful not to modify the data!

    A good compromize between the two would be to use cpython.array:

    from cpython cimport array
    import array
    
    def main(data):
        cdef array.array[int] data_arr = array.array('i', data)
        cdef int *data_ptr = data_arr.data.as_ints
    

    which gives you well defined semantics and is fast with built-in libraries.