Search code examples
pythonlocalestrxfrm

Wrong Hungarian (hu_HU) sort order


import locale
locale.setlocale(locale.LC_COLLATE,'hu_HU.ISO8859-2')
print(sorted(['c', 'á', 'b', 'z', 'é', 'a', 'd', 'e', 'f', 'Ő', 'Ű', 'ő', 'ű'], key=locale.strxfrm))

Expected: ['a', 'á', 'b', 'c', 'd', 'e', 'é', 'f', 'Ő', 'ő', 'Ű', 'ű', 'z']

Actual: ['a', 'á', 'b', 'c', 'd', 'e', 'é', 'f', 'z', 'Ő', 'ő', 'Ű', 'ű']

Note that 'z' is supposed to be the last letter.

I know my code "works" because it does put the 'á' and 'é' in the right place (and the "regular" sorted puts them after 'z'), so the bug I is in the locale definition.

I have MacOS Venture 13.6.4, python 3.11.7

How could I "update" the locale definitions? Is it something in the python or does it use the system locales?

Note: I tried to install PyICU and zope.ucol, but both failed during the installation, so don't tell me to use them.


Solution

  • Language sensitive sorting is a complex issue in Python:

    There are three approaches:

    1. Use PyICU
    2. Use a locale based sort key
    3. Create your own function to process the sort keys

    PyICU:

    PyICU is the best cross platform solution, if your python code needs to work on more than one operating system.

    import icu
    collator = icu.Collator.createInstance(icu.Locale('hu_HU'))
    data = ['c', 'á', 'b', 'z', 'é', 'a', 'd', 'e', 'f', 'Ő', 'Ű', 'ő', 'ű']
    print(sorted(data, key=collator.getSortKey))
    # ['a', 'á', 'b', 'c', 'd', 'e', 'é', 'f', 'ő', 'Ő', 'ű', 'Ű', 'z']
    

    But the OP would prefer to steer away form this solution.

    Locale:

    Results may vary across platforms and libc implementations, depending on locale used.

    The basic code is:

    import locale
    locale.setlocale(locale.LC_COLLATE,'hu_HU.UTF-8')
    data = ['c', 'á', 'b', 'z', 'é', 'a', 'd', 'e', 'f', 'Ő', 'Ű', 'ő', 'ű']
    print(sorted(data, key=locale.strxfrm))
    

    Results vary. macOS does not support Hungarian collation. Linux distributions that do not use glibc will not support Hungarian collation. I don't include Windows in the table below, but it will work correctly.

    Alpine: ['a', 'b', 'c', 'd', 'e', 'f', 'z', 'á', 'é', 'Ő', 'ő', 'Ű', 'ű']
    Freebsd: ['a', 'á', 'b', 'c', 'd', 'e', 'é', 'f', 'ő', 'Ő', 'ű', 'Ű', 'z']
    macOS: ['a', 'b', 'c', 'd', 'e', 'f', 'z', 'á', 'é', 'Ő', 'ő', 'Ű', 'ű']
    Ubuntu: ['a', 'á', 'b', 'c', 'd', 'e', 'é', 'f', 'ő', 'Ő', 'ű', 'Ű', 'z']
    

    Custom sort:

    An alternative approach is to wrote a custom function for using an accent and case insensitive sort key:

    import unicodedata as ud
    import regex
    data = ['c', 'á', 'b', 'z', 'é', 'a', 'd', 'e', 'f', 'Ő', 'Ű', 'ő', 'ű']
    # accent and case insensitive
    def ai_si(x):
        xn = ud.normalize("NFD", x.casefold())
        xn = regex.sub(r'\p{Mn}', '', xn)
        return (xn, x.swapcase())
    print(sorted(data, key=ai_si))
    # ['a', 'á', 'b', 'c', 'd', 'e', 'é', 'f', 'ő', 'Ő', 'ű', 'Ű', 'z']
    

    ai_si() returns a tuple (xn, x.swapcase()). If just xn was returned, casing and diacritics would be completely ignored, which means that their position in the final sort is relative to their position in the list. I.e. á, Á, a, and A would not have any fixed order relative to each other, it would be determined by position in the original list being sorted.

    We could use either x or x.swapcase() as the second element of the tuple. x would create a sort order where uppercase characters are usually (but not always) sorted before lowercase characters. While x.swapcase() reverse the relative order between cases. There are other ways of constructing the second element of the tuple.

    ai_si() allows a secondary distinction for diacritics and case.

    Correction (2024-05-16):

    As I indicated above sorting is challenging. The third solution I offered above will only work on single letters, it will not work on strings longer than one character.

    Let me explain:

    The CLDR collation rules for Hungarian are:

    &C<cs<<<Cs<<<CS
    &D<dz<<<Dz<<<DZ
    &DZ<dzs<<<Dzs<<<DZS
    &G<gy<<<Gy<<<GY
    &L<ly<<<Ly<<<LY
    &N<ny<<<Ny<<<NY
    &S<sz<<<Sz<<<SZ
    &T<ty<<<Ty<<<TY
    &Z<zs<<<Zs<<<ZS
    &O<ö<<<Ö<<ő<<<Ő
    &U<ü<<<Ü<<ű<<<Ű
    &cs<<<ccs/cs
    &Cs<<<Ccs/cs
    &CS<<<CCS/CS
    &dz<<<ddz/dz
    &Dz<<<Ddz/dz
    &DZ<<<DDZ/DZ
    &dzs<<<ddzs/dzs
    &Dzs<<<Ddzs/dzs
    &DZS<<<DDZS/DZS
    &gy<<<ggy/gy
    &Gy<<<Ggy/gy
    &GY<<<GGY/GY
    &ly<<<lly/ly
    &Ly<<<Lly/ly
    &LY<<<LLY/LY
    &ny<<<nny/ny
    &Ny<<<Nny/ny
    &NY<<<NNY/NY
    &sz<<<ssz/sz
    &Sz<<<Ssz/sz
    &SZ<<<SSZ/SZ
    &ty<<<tty/ty
    &Ty<<<Tty/ty
    &TY<<<TTY/TY
    &zs<<<zzs/zs
    &Zs<<<Zzs/zs
    &ZS<<<ZZS/ZS
    

    These are the rules needed to override the root CLDR collation.

    The glibc rules are much more complex, but only contain the overrides needed for the standard used. See the hu_HU.UTF-8 glibc locale definition.

    But one thing they have in common is that different combining marks are treated differently. a and á have a secondary difference. So a will sort before á but áa will sort before az. So ['az', 'áa'] sorts as ['áa', 'az'].

    In the CLDR rules we have:

    &O<ö<<<Ö<<ő<<<Ő
    &U<ü<<<Ü<<ű<<<Ű
    

    Where o and ö have a primary difference, so öa will sort after oz. Likewise üa will sort after uz.

    While ő has a secondary weight compared to ö, as does ű and ü. So őa will sort before öz but after oz. ['öz', 'oz', 'őa'] will sort as ['oz', 'őa', 'öz'].

    A much more complicated function would be required to handle this, and that's not even addressing expansions in the collation rules and handling digraphs and trigraphs.

    For macOS, three workable options are available:

    1. Use PyICU
    2. use a headless glibc based Linux distro in a virtual machine, and use the locale package to help sort.
    3. Use a notebook on Google Colab which installs the Hungarian locale.

    The approach using a custom function, is useful but is impractical, and maybe unworkable, for Hungarian. Although it is useful for a caseless sort. Although for a canonical caseless sort, I'd use

    def NFKC_Casefold(s):
        return ud.normalize("NFC", ud.normalize('NFKC', s).casefold())
    
    print(sorted(data, key=lambda x: locale.strxfrm(NFKC_Casefold(x))))
    

    Python makes sorting very complicated.