python unicode plone normalization unicode-normalization

Normalizing unicode text to filenames, etc. in Python

Are there any standalonenish solutions for normalizing international unicode text to safe ids and filenames in Python?

E.g. turn My International Text: åäö to my-international-text-aao

plone.i18n does really good job, but unfortunately it depends on zope.security and zope.publisher and some other packages making it fragile dependency.

Some operations that plone.i18n applies

Solution

What you want to do is also known as "slugify" a string. Here's a possible solution:

import re
from unicodedata import normalize

_punct_re = re.compile(r'[\t !"#$%&\'()*\-/<=>?@\[\\\]^_`{|},.:]+')

def slugify(text, delim=u'-'):
    """Generates an slightly worse ASCII-only slug."""
    result = []
    for word in _punct_re.split(text.lower()):
        word = normalize('NFKD', word).encode('ascii', 'ignore')
        if word:
            result.append(word)
    return unicode(delim.join(result))

Usage:

>>> slugify(u'My International Text: åäö')
u'my-international-text-aao'

You can also change the delimeter:

>>> slugify(u'My International Text: åäö', delim='_')
u'my_international_text_aao'

Source: Generating Slugs

For Python 3: pastebin.com/ft7Yb3KS (thanks @MrPoxipol).