Ubuntu 11.10:
$ python
Python 2.7.2+ (default, Oct 4 2011, 20:03:08)
[GCC 4.6.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> x = u'\U0001f44d'
>>> len(x)
1
>>> ord(x[0])
128077
Windows 7:
Python 2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> x = u'\U0001f44d'
>>> len(x)
2
>>> ord(x[0])
55357
My Ubuntu experience is with the default interpreter in the distribution. For Windows 7 I downloaded and installed the recommended version linked from python.org. I did not compile either of them myself.
The nature of the difference is clear to me. (On Ubuntu the string is a sequence of code points; on Windows 7 a sequence of UTF-16 code units.) My questions are:
unicode
strings on Windows only (blech)?On Ubuntu, you have a "wide" Python build where strings are UTF-32/UCS-4. Unfortunately, this isn't (yet) available for Windows.
Windows builds will be narrow for a while based on the fact that there have been few requests for wide characters, those requests are mostly from hard-core programmers with the ability to buy their own Python and Windows itself is strongly biased towards 16-bit characters.
Python 3.3 will have flexible string representation, in which you will not need to care about whether Unicode strings use 16-bit or 32-bit code units.
Until then, you can get the code points from a UTF-16 string with
def code_points(text):
utf32 = text.encode('UTF-32LE')
return struct.unpack('<{}I'.format(len(utf32) // 4), utf32)