How can I sanitize a string while maintaining all non-Latin alphabet support

Generally, I would strip all characters that are not English using something like :

$file = filter_var($file, FILTER_SANITIZE_STRING, FILTER_FLAG_STRIP_LOW | FILTER_FLAG_STRIP_HIGH );

however, I am tired of not providing support for user input from other languages which may be in the form of an uploaded file (the filename may be in Cyrillic or Chinese, or Arabic, etc) or a form field, or even content from a WYSIWYG.

The examples for sanitizing data with regards to this, come in one of two forms

Those that strip all chars which are non-English
Those that convert all chars which are non-English to English letter substitutes.

The problem with this practice, is that you end up with a broken framework that pretends it supports multiple languages however it really doesn't aside from maybe displaying labels or content to them in their language.

There are a number of attacks which take advantage of unicode/utf-8/utf-16/etc support passing null bytes and so on, so it is clear that not sanitizing the data is not an option.

Is there any way to clean up a variable from arbitrary commands while maintaining the full alphabets/chars of these other languages, but stripping out (in a generic manner) all possible non-printable chars, chars that have nulls in them as part of the char, and other such exploits while maintaining the integrity of the actual characters the user input ? The above command is perfect and does everything exactly as it should, however it would be super cool if there were a way to expand that to allow support for all languages.

Solution

Null bytes are not(!) UTF-8, so assuming you use UTF-8 internally, all you need to do is to verify that the passed variables are UTF-8. There's no need to support UTF-16, for example, because you as author of the according API or form define the correct encoding and you can limit yourself to UTF-8. Further, "unicode" is also not an encoding you need to support, simply because it is not an encoding. Rather, Unicode is a standard and the UTF encodings are part of it.

Now, back to PHP, the function you are looking for is mb_check_encoding(). Error handling is simple, if any parameter doesn't pass that test, you reply with a "bad request" response. No need to try to guess what the user might have wanted.

While the question doesn't specifically ask this, here are some examples and how they should be handled on input:

non-UTF-8 bytes: Reject with 400 ("bad request").
strings containing path elements (like ../): Accept.
filename (not file path) containing path elements (like ../): Reject with 400.
filenames شعار.jpg, 标志.png or логотип.png: Accept.
filename foo <0> bar.jpg: Accept.
number abc: Reject with 400.
number 1234: Accept.

Here's how to handle them for different outputs:

non-UTF-8 bytes: Can't happen, they were rejected before.
filename containing path elements: Can't happen, they were rejected before.
filenames شعار.jpg, 标志.png or логотип.png in HTML: Use verbatim if the HTML encoding is UTF-8, replace as HTML entities when using default ISO8859-1.
filenames شعار.jpg, 标志.png or логотип.png in Bash: Use verbatim, assuming the filesystem's encoding is UTF-8.
filenames شعار.jpg, 标志.png or логотип.png in SQL: Probably just quote, depends on the driver, DB, tables etc. Consult the manual.
filename foo <0> bar.jpg in HTML: Escape as "foo <0> bar.jpeg". Maybe use " " for the spaces.
filename foo <0> bar.jpg in Bash: Quote or escape " ", "<" and ">" with backslashes.
filename foo <0> bar.jpg in SQL: Just quote.
number abc: Can't happen, they were rejected before.
number 1234 in HTML: Use verbatim.
number 1234 in Bash: Use verbatim (not sure).
number 1234 in SQL: Use verbatim.

The general procedure should be:

Define your internal types (string, filename, number) and reject anything that doesn't match. These types create constraints (filename doesn't include path elements) and offer guarantees (filename can be appended to a directory to form a filename inside that directory).
Use a template library (Moustache comes to mind) for HTML.
Use a DB wrapper library (PDO, Propel, Doctrine) for SQL.
Escape shell parameters. I'm not sure which way to go here, but I'm sure you will find proper ways.

Escaping is not a defined procedure but a family of procedures. The actual escaping algorithm used depends on the target context. Other than what you wrote ("escaping will also screw up the names"), the actual opposite should be the case! Basically, it makes sure that a string containing a less-than sign in XML remains a string containing a less-than sign and doesn't turn into a malformed XML snippet. In order to achieve that, escaping converts strings to prevent any character that is normally not interpreted as just text from getting its normal interpretation, like the space character in the shell.