Search code examples
pythonlocalectypesglibc

Sorting list of string with specific locale in python


I work on an application that uses texts from different languages, so, for viewing or reporting purposes, some texts (strings) need to be sorted in a specific language.

Currently I have a workaround messing with the global locale settings, which is bad, and I don't want to put it in production:

default_locale = locale.getlocale(locale.LC_COLLATE)

def sort_strings(strings, locale_=None):
    if locale_ is None:
        return sorted(strings)

    locale.setlocale(locale.LC_COLLATE, locale_)
    sorted_strings = sorted(strings, cmp=locale.strcoll)
    locale.setlocale(locale.LC_COLLATE, default_locale)

    return sorted_strings

The official python locale documentation explicitly says that saving and restoring is a bad idea, but does not give any suggestions: http://docs.python.org/library/locale.html#background-details-hints-tips-and-caveats


Solution

  • Glibc does support a locale API with an explicit state. Here's a quick wrapper for that API made with ctypes.

    # -*- coding: utf-8
    import ctypes
    
    
    class Locale(object):
        def __init__(self, locale):
            LC_ALL_MASK = 8127
            # LC_COLLATE_MASK = 8
            self.libc = ctypes.CDLL("libc.so.6")
            self.ctx = self.libc.newlocale(LC_ALL_MASK, locale, 0)
    
    
    
        def strxfrm(self, src, iteration=1):
            size = 3 * iteration * len(src)
            dest =  ctypes.create_string_buffer('\000' * size)
            n = self.libc.strxfrm_l(dest, src, size,  self.ctx)
            if n < size:
                return dest.value
            elif iteration<=4:
                return self.strxfrm(src, iteration+1)
            else:
                raise Exception('max number of iterations trying to increase dest reached')
    
    
        def __del__(self):
            self.libc.freelocale(self.ctx)
    

    and a short test

    locale1 = Locale('C')
    locale2 = Locale('mk_MK.UTF-8')
    
    a_list = ['а', 'б', 'в', 'ј', 'ќ', 'џ', 'ш']
    import random
    random.shuffle(a_list)
    
    assert sorted(a_list, key=locale1.strxfrm) == ['а', 'б', 'в', 'ш', 'ј', 'ќ', 'џ']
    assert sorted(a_list, key=locale2.strxfrm) == ['а', 'б', 'в', 'ј', 'ќ', 'џ', 'ш']
    

    what's left to do is implement all the locale functions, support for python unicode strings (with wchar* functions I guess), and automatically import the include file definitions or something