Search code examples
pythonarrayspointerscythondereference

Canonical way to convert an array of strings in C to a Python list using Cython


I'm using Cython to interface a C library with Python. A library function returns an array of null-terminated strings with type char** and I want to convert this to a Python list of str. The following code works, but it seems fragile and clunky and I wonder if there is a simpler way to do it:

# myfile.pyx

from cython.operator import dereference

def results_from_c():
    cdef char** cstringsptr = my_c_function()

    strings = []

    string = dereference(cstringsptr)
    while string != NULL:
        strings.append(string.decode())
        cstringsptr += 1
        string = dereference(cstringsptr)

    return strings

In particular, is it ok to get the next string in the array with cstringsptr += 1 like one would do in C with e.g. cstringsptr++;? Is this in general a robust way to convert arrays to lists? What if e.g. memory allocation fails or the string is not null terminated and it loops forever? It seems to me like there should be a simpler way to do this with Cython.


Solution

  • To complete the answer of @alexis, in term of performance, using append is quite slow (because it use a growing array internally) and it can be replaced by direct indexing. The idea is to perform two walk to know the number of strings. While a two walks seems expensive, this should not be the case since compiler should optimize this loop. If the code is compiled with the highest optimization level (-O3), the first loop should use very fast SIMD instructions. Once the length is known, the list can be allocated/filled in a much faster way. String decoding should take a significant part of the time. UTF-8 decoding is used by default. This is a bit expensive and using ASCII decoding instead should be a bit faster assuming the strings are known not to contain special characters.

    Here is an example of untested code:

    from cython.operator import dereference
    
    def results_from_c():
        cdef char** cstringsptr = my_c_function()
        cdef int length = 0
        cdef int i
    
        string = dereference(cstringsptr)
        while string != NULL:
            cstringsptr += 1
            length += 1
            string = dereference(cstringsptr)
    
        cstringsptr -= length
    
        # None is just a null pointer so that this just allocates a 0-filled array
        strings = [None] * length
    
        for i in range(length):
            string = dereference(cstringsptr + i)
            strings[i] = string.decode()
    
        return strings
    

    This makes the code more complex though.