Search code examples
phpurlfilenamessanitization

Sanitizing strings to make them URL and filename safe?


I am trying to come up with a function that does a good job of sanitizing certain strings so that they are safe to use in the URL (like a post slug) and also safe to use as file names. For example, when someone uploads a file I want to make sure that I remove all dangerous characters from the name.

So far I have come up with the following function which I hope solves this problem and allows foreign UTF-8 data also.

/**
 * Convert a string to the file/URL safe "slug" form
 *
 * @param string $string the string to clean
 * @param bool $is_filename TRUE will allow additional filename characters
 * @return string
 */
function sanitize($string = '', $is_filename = FALSE)
{
 // Replace all weird characters with dashes
 $string = preg_replace('/[^\w\-'. ($is_filename ? '~_\.' : ''). ']+/u', '-', $string);

 // Only allow one dash separator at a time (and make string lowercase)
 return mb_strtolower(preg_replace('/--+/u', '-', $string), 'UTF-8');
}

Does anyone have any tricky sample data I can run against this - or know of a better way to safeguard our apps from bad names?

$is-filename allows some additional characters like temp vim files

update: removed the star character since I could not think of a valid use


Solution

  • Some observations on your solution:

    1. 'u' at the end of your pattern means that the pattern, and not the text it's matching will be interpreted as UTF-8 (I presume you assumed the latter?).
    2. \w matches the underscore character. You specifically include it for files which leads to the assumption that you don't want them in URLs, but in the code you have URLs will be permitted to include an underscore.
    3. The inclusion of "foreign UTF-8" seems to be locale-dependent. It's not clear whether this is the locale of the server or client. From the PHP docs:

    A "word" character is any letter or digit or the underscore character, that is, any character which can be part of a Perl "word". The definition of letters and digits is controlled by PCRE's character tables, and may vary if locale-specific matching is taking place. For example, in the "fr" (French) locale, some character codes greater than 128 are used for accented letters, and these are matched by \w.

    Creating the slug

    You probably shouldn't include accented etc. characters in your post slug since, technically, they should be percent encoded (per URL encoding rules) so you'll have ugly looking URLs.

    So, if I were you, after lowercasing, I'd convert any 'special' characters to their equivalent (e.g. é -> e) and replace non [a-z] characters with '-', limiting to runs of a single '-' as you've done. There's an implementation of converting special characters here: https://web.archive.org/web/20130208144021/http://neo22s.com/slug

    Sanitization in general

    OWASP have a PHP implementation of their Enterprise Security API which among other things includes methods for safe encoding and decoding input and output in your application.

    The Encoder interface provides:

    canonicalize (string $input, [bool $strict = true])
    decodeFromBase64 (string $input)
    decodeFromURL (string $input)
    encodeForBase64 (string $input, [bool $wrap = false])
    encodeForCSS (string $input)
    encodeForHTML (string $input)
    encodeForHTMLAttribute (string $input)
    encodeForJavaScript (string $input)
    encodeForOS (Codec $codec, string $input)
    encodeForSQL (Codec $codec, string $input)
    encodeForURL (string $input)
    encodeForVBScript (string $input)
    encodeForXML (string $input)
    encodeForXMLAttribute (string $input)
    encodeForXPath (string $input)
    

    https://github.com/OWASP/PHP-ESAPI https://www.owasp.org/index.php/Category:OWASP_Enterprise_Security_API