Search code examples
pythonregexnatural-sort

Problem with Regular Expression in Python


I have a function in python which returns a tuple of a given key for the Natural-Sort/Human algorithm.

See fake _human_key.

But I need this to change this to replace German umlauts by their standard alphabetical characters.

Long story short, I want to get rid of Ä, Ö, Ü, ß for the sorting.

Also, the case should not be considered. A small d should have the same priority as a capital D...

For the umlauts I am utilizing the replace-function which seems a pretty awkward way to do it... :-/ I have no better idea... Any suggestions?

Also I am not able to rewrite this to get rid of the case sensitiveness...

So far I have:

def _human_key(key):
    key = key.replace("Ä", "A").replace("Ö", "O").replace("Ü", "U")\
          .replace("ä", "a").replace("ö", "o").replace("ü", "u")\
          .replace("ß", "s")
    parts = re.split(r'(\d*\.\d+|\d+)', key)   
    return tuple((e.swapcase() if i % 2 == 0 else float(e))
            for i, e in enumerate(parts))
    return parts

Examples: I have the values

 Zabel
 Schneider
 anabel
 Arachno
 Öztürk
 de 'Hahn

which I want to sort; currently this puts:

anabel
de 'Hahn
Arachno
Öztürk
Schneider
Zabel

because the small characters a treated with priority...

Expectation:

anabel
Arachno
de 'Hahn   ( <-- because "d" comes after "a")
Öztürk
Schneider

I feel the replace is not the right way to achieve the problem with the umlauts, but can't find a better solution.

Update/Background information:

I am calling this from outside, from the class "QSortFilterProxyModel", I need this for sorting rows according to their clicked columns. I have a QTreeView whichs displays a result set from the database, and one column contains german family names, that's the background.

class HumanProxyModel(QtCore.QSortFilterProxyModel):
    def lessThan(self, source_left, source_right):
        data_left = source_left.data()
        data_right = source_right.data()
        if type(data_left) == type(data_right) == str:            
            return _human_key(data_left) < _human_key(data_right)            
        return super(HumanProxyModel, self).lessThan(source_left, source_right)

Solution

  • If you don't mind using third-party modules, you can use natsort (full disclosure, I am the author). For the data you give, it returns what you want out-of-the-box.

    >>> from natsort import natsorted, ns
    >>> data = ['Zabel', 'Schneider', 'anabel', 'Arachno', 'Öztürk', 'de Hahn']
    >>> natsorted(data, alg=ns.LOCALE)  # ns.LOCALE turns on locale-aware handling
    ['anabel', 'Arachno', 'de Hahn', 'Öztürk', 'Schneider', 'Zabel']
    >>> from natsort import humansorted
    >>> humansorted(data)  # shortcut for using LOCALE
    ['anabel', 'Arachno', 'de Hahn', 'Öztürk', 'Schneider', 'Zabel']
    

    If you need a sorting key, you can use natsort's key-generator:

    >>> from natsort import natsort_keygen, ns
    >>> humansort_key = natsort_keygen(alg=ns.LOCALE)
    >>> humansort_key(this) < humansort_key(that)
    

    Note, you don't necessarily need to use locale... you just need to properly normalize the unicode, which natsort automatically does under the hood. In your case, it looks like you want both capital and lower case letters grouped together with the lowercase first, so you could use this instead

    >>> natsorted(data, alg=ns.GROUPLETTERS | ns.LOWERCASEFIRST)  # or ns.G | ns.LF
    ['anabel', 'Arachno', 'de Hahn', 'Öztürk', 'Schneider', 'Zabel']
    

    I suggest this because trying to deal with locale is a nightmare, and if it is not needed then you are much better off.