URI constructor bug?

I'm wondering whether I'm overlooking something or whether this is a bug in the java.net.URI class. Consider the following code:

        URI uri = new URI(
            null,
            null,
            null,
            -1,
            "test:this.pdf",
            null,
            null
        );
        System.out.println("Scheme: " + uri.getScheme());
        System.out.println("Path: " + uri.getPath());
        System.out.println("URI: " + uri);

I'm explicitly only passing in a path. When inspecting the URI we get no path:

Scheme: test
Path: null
URI: test:this.pdf

If we add a space, we get an exception:

URI uri = new URI(
            null,
            null,
            null,
            -1,
            "test :this.pdf",
            null,
            null
        );

Saying: Illegal character in scheme name at index 4: test%20:this.pdf

If however you add a strategically placed slash, it works again:

URI uri = new URI(
            null,
            null,
            null,
            -1,
            "test /:this.pdf",
            null,
            null
        );

If you try to explicitly set a scheme, it fails for another reason:

URI uri = new URI(
            "example",
            null,
            null,
            -1,
            "test :this.pdf",
            null,
            null
        );

Saying: Relative path in absolute URI: example:test%20:this.pdf

It can not be fixed by explicitly encoding the character:

URI uri = new URI(
            null,
            null,
            null,
            -1,
            "test :this.pdf".replaceAll(":", "%3A"),
            null,
            null
        );

Because the encoding is then part of the path:

Scheme: null
Path: test %3Athis.pdf
URI: test%20%253Athis.pdf

Currently my only option seems to be to switch to the main constructor new URI(string) and explicitly determine the edge case myself and encode it:

uri = new URI("test :this.pdf".replaceAll(":", "%3A").replaceAll(" ", "%20"));

This outputs what I need:

Scheme: null
Path: test :this.pdf
URI: test%20%3Athis.pdf

The specification states:

In addition, a URI reference
   (Section 4.1) may be a relative-path reference, in which case the
   first path segment cannot contain a colon (":") character.

It does not specify further if an encoded variant is allowed, but clearly when getting the path from an URI in java, the encoded variant is allowed at that point.

In that light it seems the original constructor is flawed by not encoding the ":".

Solution

You have pretty much answered your own question by quoting the RFC: the first path segment may not contain a colon. Therefore, "test:this.pdf" cannot possibly be a valid path.

Thus, the URI class assumes you wanted a non-hierarchical URI, also known as an opaque URI. You can see this if you add this line:

System.out.println("Opaque: " + uri.isOpaque());

I feel this is, in fact, a bug in java.net.URI. The seven-argument constructor is documented as generating a hierarchical URI. I would prefer that the constructor either percent-escapes the colon or throws a URISyntaxException. However, some methods in that class have language that implies re-parsing happens internally and can change whether a resulting URI is opaque or not.

"test /:this.pdf" does not have a colon in the first path segment, which is why it works.

This example is easier to explain:

URI uri = new URI(
    "example",
    null,
    null,
    -1,
    "test :this.pdf",
    null,
    null
);

The documentation for that constructor states:

If a scheme is given then the path, if also given, must either be empty or begin with a slash character ('/').

As for doing your own percent-escaping: do not do replaceAll(":", "%3A"). The URI class already performs percent-escaping. If you try to do it yourself, you’ll only create double-escaping. However… the single-argument constructor is different and actually requires URI-unsafe characters to be percent-escaped beforehand. All of this is documented in the class documentation:

• The single-argument constructor requires any illegal characters in its argument to be quoted and preserves any escaped octets and other characters that are present.
• The multi-argument constructors quote illegal characters as required by the components in which they appear. The percent character ('%') is always quoted by these constructors. Any other characters are preserved.

As for this:

new URI("test :this.pdf".replaceAll(":", "%3A").replaceAll(" ", "%20"))

Well, okay, you can do that. (Note: regular expressions are expensive operations. Use replace(":", "%3A"), not replaceAll.) Replacing all unsafe characters is a bit more involved; see RFC 3986 for all the details. It would be something like:

Formatter uriStr = new Formatter();
for (byte b : "test :this.pdf".getBytes(StandardCharsets.UTF_8)) {
    boolean unsafe = (b <= 32 || b >= 127 ||
        "\"#%/:<>[]\\`^{}|".indexOf(b) >= 0);
    uriStr.format(unsafe ? "%%%02x" : "%c", b);
}
URI uri = new URI(uriStr.toString());

(If the URI has a query, it gets a little more complicated, since & and = also need to be escaped, but only in names and values, not in the entire URI.)

Regarding the RFC, you mentioned:

It does not specify further if an encoded variant is allowed, but clearly when getting the path from an URI in java, the encoded variant is allowed at that point.

By definition, a percent-escape is not a colon, it’s an escape sequence, so it’s allowed.

In fact, the specification does define this:

path          = path-abempty    ; begins with "/" or is empty
              / path-absolute   ; begins with "/" but not "//"
              / path-noscheme   ; begins with a non-colon segment
              / path-rootless   ; begins with a segment
              / path-empty      ; zero characters

path-abempty  = *( "/" segment )
path-absolute = "/" [ segment-nz *( "/" segment ) ]
path-noscheme = segment-nz-nc *( "/" segment )
path-rootless = segment-nz *( "/" segment )

segment       = *pchar
segment-nz    = 1*pchar
segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" )
              ; non-zero-length segment without any colon ":"

pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"

(Bold added by me.)

segment-nz-nc can contain any percent-escapes, but cannot contain a literal colon character.