C-numpy: Setting the data type for fixed-width strings?

I'm working with some data that is represented in C as strings. I'd like to return a numpy array based on this data. However, I'd like the array to have dtype='SX' where X is a number determined at runtime.

So far I am copying the data in C like so:

    buffer_len_alt = (MAX_WIDTH)*(MAX_NUMBER_OF_ITEMS);
    output_buffer = (char *) calloc(sizeof(char), buffer_len_alt);
    column = PyArray_SimpleNewFromData(1, &buffer_len_alt, NPY_BYTE, output_buffer);
    if (column == NULL){
        return (PyObject *) NULL;
    }
    /* Put strings of length MAX_WIDTH in output_buffer */
    return column;

As you can see, I am telling PyArray_SimpleNewFromData, that 'column' is a 1D array of bytes, so when the pointer we called 'column' becomes the python object 'col' we see this:

print(col)
>> array([48,  0,  0, 50, 48, 48, 48,  0,  0, 50, 48, 48, 50, 48, 48, 48,  0, 0], dtype=int8)
print(col.view('S3'))
>> array([b'0', b'200', b'0', b'200', b'200', b'0'], dtype='|S3')

The 'b' prefix tells me they are still interpreted as byte-arrays, but I want to instead have the strings "0", "200", etc. In this example the strings are digits but that is not always the case.

I know I can individually call b'200'.decode(format) to turn each individual bytes-object into a string, but the whole point of writing a C extension to numpy was to get all the loops running in C. The old chararray interface (now deprecated?) also provided an array.decode method that would decode every sequence in an array, but again the objects returned by the numpy-C interface are just plain ndarrays.

Question What typenum should I pass to SimpleNewFromData instead of NPY_BYTE so that python receives the array with the correct type information (e.g. dtype='S5') ?

ALternatively, if no typenum achieves this with SimpleNewFromData, then perhaps I need to use SimpleNewFromDescr, but I don't know how to set the PyArray_Descr parameters correctly, and the documentation is really spotty on this, so I'd greatly appreciate any form of guidance.

Solution

I'm not familiar with the C part of your code, but it appears that you are confusing the representation of byte strings and unicode strings. The b'200' display indicates that you are working in Py3, where unicode the default string type.

In a Py3 session:

The raw bytes:

In [482]: x=np.array([48,  0,  0, 50, 48, 48, 48,  0,  0, 50, 48, 48, 50, 48, 48, 48,  0, 0], dtype=np.int8)

viewed a 3 byte strings. In a PY2 session the b would not be used. But the view is the same.

In [483]: x.view('S3')
Out[483]: 
array([b'0', b'200', b'0', b'200', b'200', b'0'], 
      dtype='|S3')

A view does not change the data buffer, but astype can convert the elements as needed, and make a new array with a new data buffer.

In [484]: x.view('S3').astype('U3')
Out[484]: 
array(['0', '200', '0', '200', '200', '0'], 
      dtype='<U3')
In [485]: x.view('S3').astype('U3').view(np.uint8)
Out[485]: 
array([48,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 50,  0,  0,  0, 48,
        0,  0,  0, 48,  0,  0,  0, 48,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0, 50,  0,  0,  0, 48,  0,  0,  0, 48,  0,  0,  0, 50,  0,  0,
        0, 48,  0,  0,  0, 48,  0,  0,  0, 48,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0], dtype=uint8)

The unicode version has 72 bytes in its buffer, 4 bytes per character.

np.char is still around, but mostly to apply string methods to S and U type arrays. np.char.decode does the same thing as the astype.

In [489]: np.char.decode(x.view('S3'))
Out[489]: 
array(['0', '200', '0', '200', '200', '0'], 
      dtype='<U3')