Search code examples
pythongoogle-app-engineunicodeutf-16utf-32

How to get a reliable unicode character count in Python?


Google App Engine uses Python 2.5.2, apparently with UCS4 enabled. But the GAE datastore uses UTF-8 internally. So if you store u'\ud834\udd0c' (length 2) to the datastore, when you retrieve it, you get '\U0001d10c' (length 1). I'm trying to count of the number of unicode characters in the string in a way that gives the same result before and after storing it. So I'm trying to normalize the string (from u'\ud834\udd0c' to '\U0001d10c') as soon as I receive it, before calculating its length and putting it in the datastore. I know I can just encode it to UTF-8 and then decode again, but is there a more straightforward/efficient way?


Solution

  • I know I can just encode it to UTF-8 and then decode again

    Yes, that's the usual idiom to fix up the problem when you have “UTF-16 surrogates in UCS-4 string” input. But as Mechanical snail said, this input is malformed and you should be fixing whatever produced it in preference.

    is there a more straightforward/efficient way?

    Well... you could do it manually with a regex, like:

    re.sub(
        u'([\uD800-\uDBFF])([\uDC00-\uDFFF])',
        lambda m: unichr((ord(m.group(1))-0xD800<<10)+ord(m.group(2))-0xDC00+0x10000),
        s
    )
    

    Certainly not more straightforward... I also have my doubts as to whether it's actually more efficient!