C-Numpy: How to create fixed-width ndarray of strings from existing data

I'm writing a Python extension module in C++ with Boost Python. I want to return numpy arrays from the module to Python. It works well with numeric datatypes like double but at one point I need to create a string array from existing data.

For numeric arrays I used PyArray_SimpleNewFromData which worked well, but since strings are not fixed length I used PyArray_New where I can pass in the itemsize which is in my case 4. Here's a minimal example:

bool initNumpy()
{
    Py_Initialize();
    import_array();
    return true;
}

class Foo {
    public:            
        Foo() {
            initNumpy();
            data.reserve(10);
            data = {"Rx", "Rx", "Rx", "RxTx", "Tx", "Tx", "Tx", "RxTx", "Rx", "Tx"};                
        }

        PyObject* getArray() {
            npy_intp dims[] = { data.size() };            
            return (PyObject*)PyArray_New(&PyArray_Type, 1, dims, NPY_STRING, NULL, &data[0], 4, NPY_ARRAY_OWNDATA, NULL);
        }
    private:
        std::vector<std::string> data;             
};

I expect the output of getArray()to be equal to the output of numpy.array(["Rx", "Rx" ...], dtype="S4") which is:

array([b'Rx', b'Rx', b'Rx', b'RxTx', b'Tx', b'Tx', b'Tx', b'RxTx', b'Rx',
       b'Tx'], dtype='|S4')

but it looks like this:

array([b'Rx', b'', b'\xcc\xb3b\xd9', b'\xfe\x07', b'\x02', b'', b'\x0f',
       b'', b'Rx\x00\x03', b''], dtype='|S4')

I tried playing around with the npy_intp const* strides argument because I think the issue are the memory blocks of the underlying data. Unfortunately it didnt achieve my desired results.

I'm using Microsoft Build Tools 2017, Boost python, distutils and Python 3.7 to build the extension.

Solution

When using PyArray_New, the passed data must have the one memory-layout, which is expected by the numpy-array. It was the case for such simple data-types as np.float64, but is not the case for std::vector<std::string> and dtype='|S4'.

First, what memory layout does PyArray_New expect for |S4?

Let's choose as example

array([b'Rx', b'RxTx', b'T'], dtype='|S4')

the expected memory layout would be:

| R| x|\0|\0| R| x| T| x| T|\0|\0|\0|
|           |           |           |
|- 1. word -|- 2. word -|- 3. word -|

There are following noteworthy details:

the memory is contiguous and direct.
every element is 4 byte long, the strings are saved without NUL-terminator (see 2.word), this information is not really needed.
if a word is less than 4 characters long, the remaining characters must be set to \0, i.e. NUL-character. One is out of luck if one wants to store strings with trailing \0 - but this is another story.

A std::vector<std::string> has a completely different memory layout - and because the memory layout of std::string isn't prescribed via C++-standard, it can change from implementation to implementation.

The result of the above observations, is there is no way around copying the data if strings are given as std::vector<std::string>. The function consists of three steps:

allocate memory
copy strings to the new location
create numpy-array from the above constructed memory.

Below is an example implementation for C++11, in which error handling is left as an exercise for the reader:

PyObject* create_np_array(const std::vector<std::string> vals, size_t itemsize){

    //1. step allocate memory
    size_t mem_size = vals.size()*itemsize;
    void * mem = PyDataMem_NEW(mem_size);
    //ToDo: check mem!=nullptr
    //ToDo: make code exception safe

    //2. step initialize memory/copy data:
    size_t cur_index=0;
    for(const auto& val : vals){
        for(size_t i=0;i<itemsize;i++){
            char ch = i<val.size() ?  
                      val[i] : 
                      0; //fill with NUL if string too short
            reinterpret_cast<char*>(mem)[cur_index] = ch;
            cur_index++;
        }
    }

    //3. create numpy array
    npy_intp dim = static_cast<npy_intp>(vals.size());         
    return (PyObject*)PyArray_New(&PyArray_Type, 1, &dim, NPY_STRING, NULL, mem, 4, NPY_ARRAY_OWNDATA, NULL);

One last important thing: one should use PyDataMem_NEW to allocate data instead of malloc, if it should be owned by the resulting numpy-array (NPY_ARRAY_OWNDATA - flag). This has two advantages: the memory tracing works correctly and we don't (mis)use an implementation detail. For other ways to pass ownership of the data, see this SO-post.