Search code examples

Why isn't locale.strxfrm("Gè") a prefix of locale.strxfrm("Gène")) with locale "fr_FR.UTF-8"?

The code here is in Python, but the behavior should be the same in C/C++ using locale.

>>> import locale
>>> locale.setlocale(locale.LC_ALL, "fr_FR.UTF-8")
>>> locale.strxfrm("Gène").startswith(locale.strxfrm("Gè"))

I know it is not supposed to be used that way, but I'm wondering what is going on...

I have an array of strxfrm-transformed strings and an normal input text. I want to know which strxfrm-transformed strings started with text before transformation. Is it doable at all ? How ?

Bonus Question:

Can we get the per-locale list of equivalent letters ? Can we check for equivalent strings ?

What I mean is:
In "de_DE.UTF8", can I get something like


returning True ?

Since "ß" and "ss" are equivalent in sorting (unless it's the only difference):

> locale.strxfrm("Wiessen") < locale.strxfrm("Wießen") < locale.strxfrm("Wiessen0")

Same for "œ" and "oe" in French.

EDIT: Regarding the bonus, I saw Python locale-aware string comparison but the answer relies on 3rd party libs, so I proposed a workaround hacked function :

def isEquivalent(str1, str2):
    return ( locale.strxfrm(str2[:-1]) < locale.strxfrm(str1) <= locale.strxfrm(str2) < locale.strxfrm(str1+"0") 
    locale.strxfrm(str1[:-1]) < locale.strxfrm(str2) <= locale.strxfrm(str1) < locale.strxfrm(str2+"0") )


  • A very interesting question! This answer is not canonical, I think glibc-dev would be the best forum for that.


    The only requirement for strxfrm is this:

    strcmp(strxfrm(a), strxfrm(b)) == strcoll(a, b)

    What strxfrm allows is to export the relative order of things to another (dumber) system, for example, to maintain a secondary index in a database table.

    Let's test it

    Let's examine Python3 (Python3.9, OSX, composed normal form):

    >>> locale.strxfrm(unicodedata.normalize("NFC", "Gène"))
    >>> locale.strxfrm(unicodedata.normalize("NFC", "Gè"))

    If you were to break the output by the <SOH> byte, you'd actually get a valid substring.

    I don't know the significance of the the output essentially repeated on both sides of the separator character. 🤔

    Python 3 NFD appears to follow same semantics, but different output, which I guess only underlines how important it is to normalise your text 😼

    >>> locale.strxfrm(unicodedata.normalize("NFD", "Gène"))
    >>> locale.strxfrm(unicodedata.normalize("NFD", "Gè"))

    Other scripts have funkier output, here's Japanese in Japanese locale:

    >>> locale.strxfrm(unicodedata.normalize("NFC", "村上  春樹"))
    >>> locale.strxfrm(unicodedata.normalize("NFC", "村上春樹"))
    >>> locale.strxfrm(unicodedata.normalize("NFC", "村上"))
    >>> 'ăăăă\x01桔伍木欼' > 'ăă#ăă\x01桔伍#木欼' > 'ăă\x01桔伍'

    Python2 has a different format where the content is also repeated, but it's unclear how to detect the separator. So, let's not use Python 2, it's already EOL 😅

    >>> locale.strxfrm(unicodedata.normalize("NFC", u"Gène").encode("utf-8"))
    >>> locale.strxfrm(unicodedata.normalize("NFC", u"Gè").encode("utf-8"))

    JavaScript has the Intl module, which provides collation (ordering) via new Intl.Collator(...).compare() but as far as I know does not expose an equivalent of strxfrm. I wonder if there's some fundamental difficulty with that. I wish such function was available to build e.g. custom IndexedDB indices, but alas! 🤷‍♂️