I used to believe that CONTENT-TYPE header of a HTML page tells the browser that the contents are html or not. I have a proxy coded where I was checking content-type has text/html to decide if its HTML or not.
This works fine until I found a URL:
http://www.movingcenter.com/mc.dll?page=home
This URLs response headers are:
Connection close Date Tue, 19 Apr 2011 17:32:35 GMT Server Microsoft-IIS/6.0 X-Powered-By ASP.NET
How can I effectively decide if the page is HTML or not. In this case I know it is.
Thanks Sparsh Gupta
Any HTTP/1.1 message containing an entity-body SHOULD include a Content-Type header field defining the media type of that body. If and only if the media type is not given by a Content-Type field, the recipient MAY attempt to guess the media type via inspection of its content and/or the name extension(s) of the URI used to identify the resource. If the media type remains unknown, the recipient SHOULD treat it as type "application/octet-stream".
— http://www.w3.org/Protocols/rfc2616/rfc2616-sec7.html#sec7.2.1
So, you could inspect the start of the message body and see if you can spot a doctype or any HTML tags in it.