Search code examples
htmlbrowserhttp-headers

How browsers decide which content is HTML


I used to believe that CONTENT-TYPE header of a HTML page tells the browser that the contents are html or not. I have a proxy coded where I was checking content-type has text/html to decide if its HTML or not.

This works fine until I found a URL:

http://www.movingcenter.com/mc.dll?page=home

This URLs response headers are:

Connection    close
Date  Tue, 19 Apr 2011 17:32:35 GMT
Server    Microsoft-IIS/6.0
X-Powered-By  ASP.NET

How can I effectively decide if the page is HTML or not. In this case I know it is.

Thanks Sparsh Gupta


Solution

  • Any HTTP/1.1 message containing an entity-body SHOULD include a Content-Type header field defining the media type of that body. If and only if the media type is not given by a Content-Type field, the recipient MAY attempt to guess the media type via inspection of its content and/or the name extension(s) of the URI used to identify the resource. If the media type remains unknown, the recipient SHOULD treat it as type "application/octet-stream".

    http://www.w3.org/Protocols/rfc2616/rfc2616-sec7.html#sec7.2.1

    So, you could inspect the start of the message body and see if you can spot a doctype or any HTML tags in it.