Search code examples
htmlperlrfc3986

Resolve URI with multiple slashes in relative part


I have to write a script in perl which parses uris from html. Anyway, the real problem is how to resolve relative uris.

I have base URI (base href in html) for example http://a/b/c/d;p?q (let's go through rfc3986) and different other URIs:

/g, //g, ///g, ////g, h//g, g////h, h///g:f

In this RFC, section 5.4.1 (link above) there is only example of //g:

"//g" = "http://g"

What about all other cases? As far as I understood from rfc 3986, section 3.3, multiple slashes are allowed. So, is following resolution correct?

"///g" = "http://a/b/c///g"

Or what is should be? Does anyone can explain it better and prove it with not obsoleted rfc or documentation?

Update #1: Try to look at this working url - https:///stackoverflow.com////////a/////10161264/////6618577

What's going on here?


Solution

  • I'll start by confirming that all the URIs you provided are valid, and by providing the outcome of the URI resolutions you mentioned (and the outcome of a couple of my own):

    $ perl -MURI -e'
       for my $rel (qw( /g //g ///g ////g h//g g////h h///g:f )) {
          my $uri = URI->new($rel)->abs("http://a/b/c/d;p?q");
          printf "%-20s + %-7s = %-20s   host: %-4s   path: %s\n",
             "http://a/b/c/d;p?q", $rel, $uri, $uri->host, $uri->path;
       }
    
       for my $base (qw( http://host/a/b/c/d http://host/a/b/c//d )) {
          my $uri = URI->new("../../e")->abs($base);
          printf "%-20s + %-7s = %-20s   host: %-4s   path: %s\n",
             $base, "../../e", $uri, $uri->host, $uri->path;
       }
    '
    http://a/b/c/d;p?q   + /g      = http://a/g             host: a      path: /g
    http://a/b/c/d;p?q   + //g     = http://g               host: g      path:
    http://a/b/c/d;p?q   + ///g    = http:///g              host:        path: /g
    http://a/b/c/d;p?q   + ////g   = http:////g             host:        path: //g
    http://a/b/c/d;p?q   + h//g    = http://a/b/c/h//g      host: a      path: /b/c/h//g
    http://a/b/c/d;p?q   + g////h  = http://a/b/c/g////h    host: a      path: /b/c/g////h
    http://a/b/c/d;p?q   + h///g:f = http://a/b/c/h///g:f   host: a      path: /b/c/h///g:f
    http://host/a/b/c/d  + ../../e = http://host/a/e        host: host   path: /a/e
    http://host/a/b/c//d + ../../e = http://host/a/b/e      host: host   path: /a/b/e
    

    Next, we'll look at the syntax of relative URIs, since that's what your question circles around.

    relative-ref  = relative-part [ "?" query ] [ "#" fragment ]
    
    relative-part = "//" authority path-abempty
                  / path-absolute
                  / path-noscheme
                  / path-empty
    
    path-abempty  = *( "/" segment )
    path-absolute = "/" [ segment-nz *( "/" segment ) ]
    path-noscheme = segment-nz-nc *( "/" segment )
    path-rootless = segment-nz *( "/" segment )
    
    segment       = *pchar         ; 0 or more <pchar>
    segment-nz    = 1*pchar        ; 1 or more <pchar>   nz = non-zero
    

    The key things from these rules for answering your question:

    • An absolute path (path-absolute) can't start with //. The first segment, if provided, must be non-zero in length. If the relative URI starts with //, what follows must be an authority.
    • // can otherwise occur in a path because segments can have zero-length.

    Now, let's look at each of the resolutions you provided in turn.

    /g is an absolute path path-absolute, and thus a valid relative URI (relative-ref), and thus a valid URI (URI-reference).

    • Parsing the URIs (say, using the regular expression in Appendix B) gives us the following:

      Base.scheme:    "http"       R.scheme:    undef
      Base.authority: "a"          R.authority: undef
      Base.path:      "/b/c/d;p"   R.path:      "/g"
      Base.query:     "q"          R.query:     undef
      Base.fragment:  undef        R.fragment:  undef
      
    • Following the algorithm in §5.2.2, we get:

      T.path:         "/g"      ; remove_dot_segments(R.path)
      T.query:        undef     ; R.query
      T.authority:    "a"       ; Base.authority
      T.scheme:       "http"    ; Base.scheme
      T.fragment:     undef     ; R.fragment
      
    • Following the algorithm in §5.3, we get:

      http://a/g
      

    //g is different. //g isn't an absolute path (path_absolute) because an absolute path can't start with an empty segment ("/" [ segment-nz *( "/" segment ) ]).

    Instead, it's follows the following pattern:

    "//" authority path-abempty
    
    • Parsing the URIs (say, using the regular expression in Appendix B) gives us the following:

      Base.scheme:    "http"       R.scheme:    undef
      Base.authority: "a"          R.authority: "g"
      Base.path:      "/b/c/d;p"   R.path:      ""
      Base.query:     "q"          R.query:     undef
      Base.fragment:  undef        R.fragment:  undef
      
    • Following the algorithm in §5.2.2, we get the following:

      T.authority:    "g"           ; R.authority
      T.path:         ""            ; remove_dot_segments(R.path)
      T.query:        ""            ; R.query
      T.scheme:       "http"        ; Base.scheme
      T.fragment:     undef         ; R.fragment
      
    • Following the algorithm in §5.3, we get the following:

      http://g
      

    Note: This contacts server g!


    ///g is similar to //g, except the authority is blank! This is surprisingly valid.

    • Parsing the URIs (say, using the regular expression in Appendix B) gives us the following:

      Base.scheme:    "http"       R.scheme:    undef
      Base.authority: "a"          R.authority: ""
      Base.path:      "/b/c/d;p"   R.path:      "/g"
      Base.query:     "q"          R.query:     undef
      Base.fragment:  undef        R.fragment:  undef
      
    • Following the algorithm in §5.2.2, we get the following:

      T.authority:    ""        ; R.authority
      T.path:         "/g"      ; remove_dot_segments(R.path)
      T.query:        undef     ; R.query
      T.scheme:       "http"    ; Base.scheme
      T.fragment:     undef     ; R.fragment
      
    • Following the algorithm in §5.3, we get the following:

      http:///g
      

    Note: While valid, this URI is useless because the server name (T.authority) is blank!


    ////g is the same as ///g except the R.path is //g, so we get

        http:////g
    

    Note: While valid, this URI is useless because the server name (T.authority) is blank!


    The final three (h//g, g////h, h///g:f) are all relative paths (path-noscheme).

    • Parsing the URIs (say, using the regular expression in Appendix B) gives us the following:

      Base.scheme:    "http"       R.scheme:    undef
      Base.authority: "a"          R.authority: undef
      Base.path:      "/b/c/d;p"   R.path:      "h//g"
      Base.query:     "q"          R.query:     undef
      Base.fragment:  undef        R.fragment:  undef
      
    • Following the algorithm in §5.2.2, we get the following:

      T.path:         "/b/c/h//g"    ; remove_dot_segments(merge(Base.path, R.path))
      T.query:        undef          ; R.query
      T.authority:    "a"            ; Base.authority
      T.scheme:       "http"         ; Base.scheme
      T.fragment:     undef          ; R.fragment
      
    • Following the algorithm in §5.3, we get the following:

      http://a/b/c/h//g         # For h//g
      http://a/b/c/g////h       # For g////h
      http://a/b/c/h///g:f      # For h///g:f
      

    I don't think the examples are suitable for answering what I think you really want to know, though.

    Take a look at the following two URIs. They aren't equivalent.

    http://host/a/b/c/d     # Path has 4 segments: "a", "b", "c", "d"
    

    and

    http://host/a/b/c//d    # Path has 5 segments: "a", "b", "c", "", "d"
    

    Most servers will treat them the same —which is fine since servers are free to interpret paths in any way they wish— but it makes a difference when applying relative paths. For example, if these were the base URI for ../../e, you'd get

    http://host/a/b/c/d + ../../e = http://host/a/e
    

    and

    http://host/a/b/c//d + ../../e = http://host/a/b/e