javascript unicode uri encodeuricomponent

encodeURIComponent throws an exception

I am programmatically building a URI with the help of the encodeURIComponent function using user provided input. However, when the user enters invalid unicode characters (such as U+DFFF), the function throws an exception with the following message:

The URI to be encoded contains an invalid character

I looked this up on MSDN, but that didn't tell me anything I didn't already know.

To correct this error

Ensure the string to be encoded contains only valid Unicode sequences.

My question is, is there a way to sanitize the user provided input to remove all invalid Unicode sequences before I pass it on to the encodeURIComponent function?

Solution

Taking the programmatic approach to discover the answer, the only range that turned up any problems was \ud800-\udfff, the range for high and low surrogates:

for (var regex = '/[', firstI = null, lastI = null, i = 0; i <= 65535; i++) {
    try {
        encodeURIComponent(String.fromCharCode(i));
    }
    catch(e) {
        if (firstI !== null) {
            if (i === lastI + 1) {
                lastI++;
            }
            else if (firstI === lastI) {
                regex += '\\u' + firstI.toString(16);
                firstI = lastI = i; 
            }
            else {
                regex += '\\u' + firstI.toString(16) + '-' + '\\u' + lastI.toString(16);
                firstI = lastI = i; 
            }
        }
        else {
            firstI = i;
            lastI = i;
        }        
    }
}

if (firstI === lastI) {
    regex += '\\u' + firstI.toString(16);
}
else {
    regex += '\\u' + firstI.toString(16) + '-' + '\\u' + lastI.toString(16);
}
regex += ']/';
alert(regex);  // /[\ud800-\udfff]/

I then confirmed this with a simpler example:

for (var i = 0; i <= 65535 && (i <0xD800 || i >0xDFFF ) ; i++) {
    try {
        encodeURIComponent(String.fromCharCode(i));
    }
    catch(e) {
        alert(e); // Doesn't alert
    }
}
alert('ok!');

And this fits with what MSDN says because indeed all those Unicode characters (even valid Unicode "non-characters") besides surrogates are all valid Unicode sequences.

You can indeed filter out high and low surrogates, but when used in a high-low pair, they become legitimate (as they are meant to be used in this way to allow for Unicode to expand (drastically) beyond its original maximum number of characters):

alert(encodeURIComponent('\uD800\uDC00')); // ok
alert(encodeURIComponent('\uD800')); // not ok
alert(encodeURIComponent('\uDC00')); // not ok either

So, if you want to take the easy route and block surrogates, it is just a matter of:

urlPart = urlPart.replace(/[\ud800-\udfff]/g, '');

If you want to strip out unmatched (invalid) surrogates while allowing surrogate pairs (which are legitimate sequences but the characters are rarely ever needed), you can do the following:

function stripUnmatchedSurrogates (str) {
    return str.replace(/[\uD800-\uDBFF](?![\uDC00-\uDFFF])/g, '').split('').reverse().join('').replace(/[\uDC00-\uDFFF](?![\uD800-\uDBFF])/g, '').split('').reverse().join('');
}

var urlPart = '\uD801 \uD801\uDC00 \uDC01'
alert(stripUnmatchedSurrogates(urlPart)); // Leaves one valid sequence (representing a single non-BMP character)

If JavaScript had negative lookbehind the function would be a lot less ugly...