Search code examples
pythoncunicodepython-c-api

Python C API unicode arguments


I have a simple python script

import _tph
str = u'Привет, <b>мир!</b>' # Some unicode string with a russian characters
_tph.strip_tags(str)

and C library, which is compiled into _tph.so. This is a strip_tags function from it:

PyObject *strip_tags(PyObject *self, PyObject *args) {
    PyUnicodeObject *string;
    Py_ssize_t length;

    PyArg_ParseTuple(args, "u#", &string, &length);
    printf("%d, %d\n", string->length, length);

    // ...
}

printf function prints this: 1080, 19. So, str length is really 19 symbols, but from what deep of hell I'm getting those 1080 characters?

When I'm printing string, I got my str, null char, and then a lot of junk bytes.

Junk memory looks like this:

u'\u041f\u0440\u0438\u0432\u0435\u0442, <b>\u043c\u0438\u0440!</b>\x00\x00\u0299\Ub7024000\U08c55800\Ub7025904\x00\Ub777351c\U08c79e58\x00\U08c7a0b4\x00\Ub7025904\Ub7025954\Ub702594c\Ub702591c\Ub702592c\Ub7025934\x00\x00\x00

How I can get a normal string here?


Solution

  • The "string" argument here isn't well named. It is a pointer to a Python Unicode object, so your printf is seeing a lot of binary data (the object type, GC headers, the ref count, and the encoded unicode code points) until it happens to find a zero byte which printf interprets as the end of the string.

    The simplest way to view the string is with PyObject_Print(string). You can find the C functions for manipulating Python unicode objects at: http://docs.python.org/c-api/unicode.html#unicode-objects