I'm writing a Python extension module in C++ with Boost Python. I want to return numpy arrays from the module to Python. It works well with numeric datatypes like double
but at one point I need to create a string
array from existing data.
For numeric arrays I used PyArray_SimpleNewFromData
which worked well, but since strings are not fixed length I used PyArray_New
where I can pass in the itemsize which is in my case 4. Here's a minimal example:
bool initNumpy()
{
Py_Initialize();
import_array();
return true;
}
class Foo {
public:
Foo() {
initNumpy();
data.reserve(10);
data = {"Rx", "Rx", "Rx", "RxTx", "Tx", "Tx", "Tx", "RxTx", "Rx", "Tx"};
}
PyObject* getArray() {
npy_intp dims[] = { data.size() };
return (PyObject*)PyArray_New(&PyArray_Type, 1, dims, NPY_STRING, NULL, &data[0], 4, NPY_ARRAY_OWNDATA, NULL);
}
private:
std::vector<std::string> data;
};
I expect the output of getArray()
to be equal to the output of numpy.array(["Rx", "Rx" ...], dtype="S4")
which is:
array([b'Rx', b'Rx', b'Rx', b'RxTx', b'Tx', b'Tx', b'Tx', b'RxTx', b'Rx',
b'Tx'], dtype='|S4')
but it looks like this:
array([b'Rx', b'', b'\xcc\xb3b\xd9', b'\xfe\x07', b'\x02', b'', b'\x0f',
b'', b'Rx\x00\x03', b''], dtype='|S4')
I tried playing around with the npy_intp const* strides
argument because I think the issue are the memory blocks of the underlying data. Unfortunately it didnt achieve my desired results.
I'm using Microsoft Build Tools 2017, Boost python, distutils and Python 3.7 to build the extension.
When using PyArray_New
, the passed data must have the one memory-layout, which is expected by the numpy-array. It was the case for such simple data-types as np.float64
, but is not the case for std::vector<std::string>
and dtype='|S4'
.
First, what memory layout does PyArray_New
expect for |S4
?
Let's choose as example
array([b'Rx', b'RxTx', b'T'], dtype='|S4')
the expected memory layout would be:
| R| x|\0|\0| R| x| T| x| T|\0|\0|\0|
| | | |
|- 1. word -|- 2. word -|- 3. word -|
There are following noteworthy details:
\0
, i.e. NUL-character. One is out of luck if one wants to store strings with trailing \0
- but this is another story.A std::vector<std::string>
has a completely different memory layout - and because the memory layout of std::string
isn't prescribed via C++-standard, it can change from implementation to implementation.
The result of the above observations, is there is no way around copying the data if strings are given as std::vector<std::string>
. The function consists of three steps:
Below is an example implementation for C++11, in which error handling is left as an exercise for the reader:
PyObject* create_np_array(const std::vector<std::string> vals, size_t itemsize){
//1. step allocate memory
size_t mem_size = vals.size()*itemsize;
void * mem = PyDataMem_NEW(mem_size);
//ToDo: check mem!=nullptr
//ToDo: make code exception safe
//2. step initialize memory/copy data:
size_t cur_index=0;
for(const auto& val : vals){
for(size_t i=0;i<itemsize;i++){
char ch = i<val.size() ?
val[i] :
0; //fill with NUL if string too short
reinterpret_cast<char*>(mem)[cur_index] = ch;
cur_index++;
}
}
//3. create numpy array
npy_intp dim = static_cast<npy_intp>(vals.size());
return (PyObject*)PyArray_New(&PyArray_Type, 1, &dim, NPY_STRING, NULL, mem, 4, NPY_ARRAY_OWNDATA, NULL);
One last important thing: one should use PyDataMem_NEW
to allocate data instead of malloc
, if it should be owned by the resulting numpy-array (NPY_ARRAY_OWNDATA
- flag). This has two advantages: the memory tracing works correctly and we don't (mis)use an implementation detail. For other ways to pass ownership of the data, see this SO-post.