Search code examples
pythonsortingtextlocalecase-insensitive

How to sort text strings in Python case-insensitively AND deterministically


I have the standard requirement to sort a list of text strings case-insensitively. However additionally this sort needs to be deterministic in the way that two lists containing the same elements should lead to the same sorted list.

To explain the issue, consider the following two lists:

l1 = ['alfred', 'Berta', 'berta', 'carl']
l2 = ['alfred', 'berta', 'carl', 'Berta']

Obviously both lists have the same elements:

>>> set(l1) == set(l2)
True

For sorting these lists I used in a first attempt the builtin function sorted function with the str.casefold as key argument. But this leads to two different result lists:

>>> sorted(l1, key=str.casefold)
['alfred', 'Berta', 'berta', 'carl']
>>> sorted(l2, key=str.casefold)
['alfred', 'berta', 'Berta', 'carl']

However I require in both cases the same output

['alfred', 'Berta', 'berta', 'carl']

How can I achieve this?

The problem with my approach is that the key function "eats some information" of the elements to sort. So is it possible to do this with a particular key function to sorted? This function would need to retain the full case information, but still sort an upper letter next to the equivalent lower letter....


Bonus question:

How can I make this sort completely locale-independent, so that when reading the lists from the same files on machines with different country settings the final list is always sorted exactly teh same way.


Solution

  • You can use a key function that returns a tuple with the case-folded string and then the original string. This will cause the case-folded strings to be compared first, making the sort case-insensitive. Then if the case-folded strings are identical, the original strings will be compared, ensuring that the result is deterministic.

    def deterministic_casefold(s):
        return s.casefold(), s
    
    sorted(l1, key=deterministic_casefold)
    sorted(l2, key=deterministic_casefold)