Search code examples
htmlhttphttprequestfiddler

Where is the HOST header taken from when relative requests are made?


According to the SPEC :

The most common form of Request-URI is that used to identify a resource on an origin server or gateway. In this case the absolute path of the URI MUST be transmitted as the Request-URI, and the network location of the URI (authority) MUST be transmitted in a Host header field.
For example, a client wishing to retrieve the resource above directly from the origin server would create a TCP connection to port 80 of the host "www.w3.org" and send the lines:

   GET /pub/WWW/TheProject.html HTTP/1.1
   Host: www.w3.org

So when a user make this request , it gets a response.

Ok now that this cycle is over(!) - a user clicks on a <a href="/help">Help</a> link.

  • Notice that the address is relative.

  • Also notice - the form does not(!) has an action with the base url ("www.w3.org/help") .

Example (look in the iframe's view source) :

http://i.imgur.com/uChSE6Q.png

  • Also notice - that there is no <base> tag which represents the base url

Ok. so what is the question ?

Question

If a user clicks on the hyperlink , how does the browser knows the host value to go to ? AFAIK it is not from the address bar url .

I know that document.location contains all the information but still - I dont think JS is involved here.

Knowing that the previous cycle (request) is over ( the first request) - Where does the host header value is taken from when relative requests are now made ?

A SPEC reference would be much appreciated.

Edit:

I've been investigating it a bit with Fiddler :

So for this html :

<body>
  <a href="/GetSomething"> Click me</a>
</body>

Fiddler show this result :

GET http://null.jsbin.com/GetSomething HTTP/1.1
Host: null.jsbin.com
Connection: keep-alive
Pragma: no-cache
Cache-Control: no-cache
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36
DNT: 1
Referer: http://null.jsbin.com/runner
Accept-Encoding: gzip, deflate, sdch
Accept-Language: en-US,en;q=0.8,he;q=0.6
Cookie: _ga=GA1.2.474312101.1437654587; _gat=1; jsbin=s%3A...

The URL is the full URL ( obviously) :

http://null.jsbin.com/GetSomething

It might be very simple question to ask , but still - where does it take the BASE url from ? (Javascript ( document.location?) ? address bar url ? some internal storage inside the browser?)


Solution

  • http://www.w3.org/TR/html401/struct/links.html#edef-BASE

    " 12.4.1 Resolving relative URIs

    User agents must calculate the base URI for resolving relative URIs according to [RFC1808], section 3. The following describes how [RFC1808] applies specifically to HTML.

    User agents must calculate the base URI according to the following precedences (highest priority to lowest):

    1. The base URI is set by the BASE element.
    2. The base URI is given by meta data discovered during a protocol interaction, such as an HTTP header (see [RFC2616]).
    1. By default, the base URI is that of the current document. Not all HTML documents have a base URI (e.g., a valid HTML document may appear in an email and may not be designated by a URI). Such HTML documents are considered erroneous if they contain relative URIs and rely on a default base URI.

    "

    www.anotherebsite.com/action_page.php defined in <a href=""/> is known as a relative link to the browser. So on clicking the browser doesn't update the host it's sending the request.

    But http://www.anotherebsite.com/action_page.php is a remote address for browser and updates the host on click and navigate client to the remote address.

    User agent is the browser or the application that connects to the server to send and receive requests from client side. Example of user agents are browsers like Firefox,IE,Chrome. Each time when a user agent/browser want a specific page to be displayed it send request to the server asking for that content. and the server responds the way it want to(normally). The server than sends the requested data as text to the ip of the browser and is navigated to the client by routers and so on. On receiving text the browser changes markups according to HTML for the user interface.

    Browsers are designed to have local storage spaces caches that stores the links you visit, information and other data like form data, window content, passwords and history etc.

    GET http://null.jsbin.com/GetSomething HTTP/1.1
    Host: null.jsbin.com
    

    In here the browser is requesting the host(null.jsbin.com) to sent (GetSomething) The data is sent to "null.jsbin.com" asking for "GetSomething"

    Server identifies the browser through following

    • Cookies (stored on the client side, that could be retrieved later)
    • IP address
    • Browser fingerprinting

    here's a way to check what data actually can be retrieved from your browser by web servers https://panopticlick.eff.org/index.php?action=log&js=yes. that data can be used for browser fingerprinting to identify the browsers by servers.

    As the network model says a persistent (connection may not be real persistent as the page has already loaded, but as ip addresses and other identifiers as mentioned above the server knows that its the same browser/user agent during the same session) connection is managed in between a client and server through session layer. The server knows its clients through the data that it takes while handshakes. and later it can know thorough instantiating a session id for each client and also through cookies the client could remember