Search code examples
unicodepython-repython-unicode

Why does the regex "[a-z]" match against the non-ASCII characters "İıſK" when the case-insensitive flag is used?


The following Python code (version 3.11.0) gives an unexpected result:

import re
import sys

s = ''.join(map(chr, range(sys.maxunicode + 1)))
matches = ''.join(re.findall('[a-z]', s, re.IGNORECASE))
print(matches)

It prints the extra 4 non-ASCII characters 'İıſK':

ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzİıſK

This is actually documented, but without any explanation as to why it behaves like this:

Note that when the Unicode patterns [a-z] or [A-Z] are used in combination with the IGNORECASE flag, they will match the 52 ASCII letters and 4 additional non-ASCII letters: ‘İ’ (U+0130, Latin capital letter I with dot above), ‘ı’ (U+0131, Latin small letter dotless i), ‘ſ’ (U+017F, Latin small letter long s) and ‘K’ (U+212A, Kelvin sign). If the ASCII flag is used, only letters ‘a’ to ‘z’ and ‘A’ to ‘Z’ are matched.

I could maybe understand matching against the Kelvin sign, but the others make no sense to me. Is this just a bug or is there a deeper reason why it should behave like this?


Solution

  • Those characters are considered (at least in some situations/locales) to be lower-/upper-case variants of the "traditional" ASCII a-z characters:

    (See the "Uppercase Character" and "Lowercase Character" entries on these pages, which are directly taken from the Unicode data set).

    Why are these "non-default" characters marked this way? Because in some sense or in some locales those are actually, valid relatives. For example due to the existence of the Dotless I both dotted and dotless variants of I exist in upper and in lower case and cause frequent problems in software. Similarly if you had a text that contained a ſ and you wanted to convert it to upper-case, then S would be the most appropriate candidate.