Search code examples
javaurlencodingutf

Illegal characters in URI


The java.net.URI ctor accepts most non-ASCII characters but does not accept ideographic space (0x3000). The ctor fails with java.net.URISyntaxException: Illegal character in path ...

So my questions are:

  • Why doesn't the URI ctor accept 0x3000 but does accept other non-ASCII characters ?
  • What other characters doesn't it accept ?

Solution

  • Please note the 1st example contains the ideographic space rather than a regular space.

    It is the ideographic space that is the problem.

    Here is the code that allows non-ASCII characters to be used:

            } else if ((c > 128)
                       && !Character.isSpaceChar(c)
                       && !Character.isISOControl(c)) {
                // Allow unescaped but visible non-US-ASCII chars
                return p + 1;
            }
    

    As you can see, it disallows "funky" non-visible characters.

    See also the URI class javadocs which specifies which characters are allowed (by the class!) in each component of a URI.

    Why?

    It is probably a safety measure.

    What others are disallowed?

    An character that is whitespace or a control character ... according to the respective Character predicate methods. (See the Character javadocs for a precise specification.)

    You should also note that this is a deviation from the URI specification. The URI specification says that non-ASCII characters are only allowed if you:

    • convert the UCS character code to UTF-8, and
    • percent encode the UTF-8 bytes as required by the spec.

    My understanding is that the URI.toASCIIString() method will take care of that if you have a "deviant" java.net.URI object.