Search code examples
pythonunicodeplonenormalizationunicode-normalization

Normalizing unicode text to filenames, etc. in Python


Are there any standalonenish solutions for normalizing international unicode text to safe ids and filenames in Python?

E.g. turn My International Text: åäö to my-international-text-aao

plone.i18n does really good job, but unfortunately it depends on zope.security and zope.publisher and some other packages making it fragile dependency.

Some operations that plone.i18n applies


Solution

  • What you want to do is also known as "slugify" a string. Here's a possible solution:

    import re
    from unicodedata import normalize
    
    _punct_re = re.compile(r'[\t !"#$%&\'()*\-/<=>?@\[\\\]^_`{|},.:]+')
    
    def slugify(text, delim=u'-'):
        """Generates an slightly worse ASCII-only slug."""
        result = []
        for word in _punct_re.split(text.lower()):
            word = normalize('NFKD', word).encode('ascii', 'ignore')
            if word:
                result.append(word)
        return unicode(delim.join(result))
    

    Usage:

    >>> slugify(u'My International Text: åäö')
    u'my-international-text-aao'
    

    You can also change the delimeter:

    >>> slugify(u'My International Text: åäö', delim='_')
    u'my_international_text_aao'
    

    Source: Generating Slugs

    For Python 3: pastebin.com/ft7Yb3KS (thanks @MrPoxipol).