I have a large binary data file which I want to load into a C array for fast access. The data file just contains a sequence of 4 byte ints.
I get the data via the pkgutil.get_data function, which returns a binary string. the following code works:
import pkgutil
import struct
cdef int data[32487834]
def load_data():
global data
py_data = pkgutil.get_data('my_module', 'my_data')
for i in range(32487834):
data[i] = <int>struct.unpack('i', py_data[4*i:4*(i+1)])[0]
return 0
load_data()
The problem is that this code is quite slow. Reading the whole data file can take 7 or 8 seconds. Reading the file directly into an array in C only takes 1-2 seconds, but I want to use pkgutil.get_data so that my module can reliably find the data whereever it gets installed.
So, my question is: what's the best way to do this? Is there a way to directly cast the data as an array of ints without all the calls to struct.unpack? And, as a secondary question, is there a way to simply get a pointer to the data to avoid copying 120MB of data unnecessarily?
Alternatively, is there a way to make pkgutil return the file path to the data instead of the data itself (in which case I can use C file IO to read the file quite quickly.
EDIT:
Just for the record, here's the final code used (based on Veedrac's answer):
import pkgutil
from cpython cimport array
import array
cdef int[:] data
cdef void load_data():
global data
py_data = pkgutil.get_data('my_module', 'my_data')
data = array.array('i', py_data)
load_data()
Everything is quite fast.
Chances are you should really just use Numpy:
import numpy
import random
import struct
data = struct.pack('i'*100, *[random.randint(0, 1000000) for _ in range(100)])
numpy.fromstring(data, dtype="int32")
#>>> array([642029, 967046, 599565, ...etc], dtype=int32)
Then just use any of the standard methods to get a pointer from that.
If you want to avoid Numpy, a faster but less platform-agnostic method would be to go via a char pointer:
cdef int *data_view = <int *><char *>data
This has lots of "undefined"-ness to it, so be careful. Also be careful not to modify the data!
A good compromize between the two would be to use cpython.array
:
from cpython cimport array
import array
def main(data):
cdef array.array[int] data_arr = array.array('i', data)
cdef int *data_ptr = data_arr.data.as_ints
which gives you well defined semantics and is fast with built-in libraries.