I'm using the awesome regex module, trying its \X
grapheme support.
First, I try with the plain old .
>>> print regex.match('.', 'Ä').group(0)
>>> print regex.match('..', 'Ä').group(0)
Ä
It went as expected. Move on to \X
>>> print regex.match('\X', 'Ä').group(0)
>>> print regex.match('\X\X', 'Ä').group(0)
Ä
Why is it the same as .
? Shouldn't a single \X
be enough to capture the A-umlaut? Is it:
\X
is wrong?It works by defining the Ä
as unicode character.
>>> print regex.match('.', u'Ä').group()
Ä
>>> print regex.match('\X', u'Ä').group()
Ä
The main difference between Python 2 and Python 3 is the basic types that exist to deal with texts and bytes. On Python 3 we have one text type: str
which holds Unicode data and two byte types bytes and bytearray.
On the other hand on Python 2 we have two text types: str
which for all intents and purposes is limited to ASCII + some undefined data above the 7 bit range, unicode which is equivalent to the Python 3 str type and one byte type bytearray which it inherited from Python 3.
Reference - https://docs.python.org/2/howto/unicode.html#python-2-x-s-unicode-support