Search code examples
pythonfilenamesslugsanitize

Turn a string into a valid filename?


I have a string that I want to use as a filename, so I want to remove all characters that wouldn't be allowed in filenames, using Python.

I'd rather be strict than otherwise, so let's say I want to retain only letters, digits, and a small set of other characters like "_-.() ". What's the most elegant solution?

The filename needs to be valid on multiple operating systems (Windows, Linux and Mac OS) - it's an MP3 file in my library with the song title as the filename, and is shared and backed up between 3 machines.


Solution

  • You can look at the Django framework (but take their licence into account!) for how they create a "slug" from arbitrary text. A slug is URL- and filename- friendly.

    The Django text utils define a function, slugify(), that's probably the gold standard for this kind of thing. Essentially, their code is the following.

    import unicodedata
    import re
    
    def slugify(value, allow_unicode=False):
        """
        Taken from https://github.com/django/django/blob/master/django/utils/text.py
        Convert to ASCII if 'allow_unicode' is False. Convert spaces or repeated
        dashes to single dashes. Remove characters that aren't alphanumerics,
        underscores, or hyphens. Convert to lowercase. Also strip leading and
        trailing whitespace, dashes, and underscores.
        """
        value = str(value)
        if allow_unicode:
            value = unicodedata.normalize('NFKC', value)
        else:
            value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore').decode('ascii')
        value = re.sub(r'[^\w\s-]', '', value.lower())
        return re.sub(r'[-\s]+', '-', value).strip('-_')
    

    And the older version:

    def slugify(value):
        """
        Normalizes string, converts to lowercase, removes non-alpha characters,
        and converts spaces to hyphens.
        """
        import unicodedata
        value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore')
        value = unicode(re.sub('[^\w\s-]', '', value).strip().lower())
        value = unicode(re.sub('[-\s]+', '-', value))
        # ...
        return value
    

    There's more, but I left it out, since it doesn't address slugification, but escaping.