How can I generate URL slugs in Perl?

Web frameworks such as Rails and Django has built-in support for "slugs" which are used to generate readable and SEO-friendly URLs:

A slug string typically contains only of the characters a-z, 0-9 and - and can hence be written without URL-escaping (think "foo%20bar").

I'm looking for a Perl slug function that given any valid Unicode string will return a slug representation (a-z, 0-9 and -).

A super trivial slug function would be something along the lines of:

$input = lc($input),
$input =~ s/[^a-z0-9-]//g;

However, this implementation would not handle internationalization and accents (I want ë to become e). One way around this would be to enumerate all special cases, but that would not be very elegant. I'm looking for something more well thought out and general.

My question:

What is the most general/practical way to generate Django/Rails type slugs in Perl? This is how I solved the same problem in Java.

Solution

The slugify filter currently used in Django translates (roughly) to the following Perl code:

use Unicode::Normalize;

sub slugify($) {
    my ($input) = @_;

    $input = NFKD($input);         # Normalize (decompose) the Unicode string
    $input =~ tr/\000-\177//cd;    # Strip non-ASCII characters (>127)
    $input =~ s/[^\w\s-]//g;       # Remove all characters that are not word characters (includes _), spaces, or hyphens
    $input =~ s/^\s+|\s+$//g;      # Trim whitespace from both ends
    $input = lc($input);
    $input =~ s/[-\s]+/-/g;        # Replace all occurrences of spaces and hyphens with a single hyphen

    return $input;
}

Since you also want to change accented characters to unaccented ones, throwing in a call to unidecode (defined in Text::Unidecode) before stripping the non-ASCII characters seems to be your best bet (as pointed out by phaylon).

In that case, the function could look like:

use Unicode::Normalize;
use Text::Unidecode;

sub slugify_unidecode($) {
    my ($input) = @_;

    $input = NFC($input);          # Normalize (recompose) the Unicode string
    $input = unidecode($input);    # Convert non-ASCII characters to closest equivalents
    $input =~ s/[^\w\s-]//g;       # Remove all characters that are not word characters (includes _), spaces, or hyphens
    $input =~ s/^\s+|\s+$//g;      # Trim whitespace from both ends
    $input = lc($input);
    $input =~ s/[-\s]+/-/g;        # Replace all occurrences of spaces and hyphens with a single hyphen

    return $input;
}

The former works well for strings that are primarily ASCII, but falls short when the entire string is formed of non-ASCII characters, since they all get stripped out, leaving you with an empty string.

Sample output:

string        | slugify       | slugify_unidecode
-------------------------------------------------
hello world     hello world     hello world
北亰                            bei-jing
liberté         liberta         liberte

Note how 北亰 gets slugifies to nothing with the Django-inspired implementation. Note also the difference the NFC normalization makes -- liberté becomes 'liberta' with NFKD after stripping out the second part of the decomposed character, but would becomes 'libert' after stripping out the re-assembled 'é' with NFC.