I have a simple python script
import _tph
str = u'Привет, <b>мир!</b>' # Some unicode string with a russian characters
_tph.strip_tags(str)
and C library, which is compiled into _tph.so. This is a strip_tags
function from it:
PyObject *strip_tags(PyObject *self, PyObject *args) {
PyUnicodeObject *string;
Py_ssize_t length;
PyArg_ParseTuple(args, "u#", &string, &length);
printf("%d, %d\n", string->length, length);
// ...
}
printf
function prints this: 1080, 19. So, str
length is really 19 symbols, but from what deep of hell I'm getting those 1080 characters?
When I'm printing string
, I got my str
, null char, and then a lot of junk bytes.
Junk memory looks like this:
u'\u041f\u0440\u0438\u0432\u0435\u0442, <b>\u043c\u0438\u0440!</b>\x00\x00\u0299\Ub7024000\U08c55800\Ub7025904\x00\Ub777351c\U08c79e58\x00\U08c7a0b4\x00\Ub7025904\Ub7025954\Ub702594c\Ub702591c\Ub702592c\Ub7025934\x00\x00\x00
How I can get a normal string here?
The "string" argument here isn't well named. It is a pointer to a Python Unicode object, so your printf is seeing a lot of binary data (the object type, GC headers, the ref count, and the encoded unicode code points) until it happens to find a zero byte which printf interprets as the end of the string.
The simplest way to view the string is with PyObject_Print(string)
. You can find the C functions for manipulating Python unicode objects at: http://docs.python.org/c-api/unicode.html#unicode-objects