Search code examples
pythondownloadhtml-parsingbeautifulsoupprinting-web-page

Download a URL only if it is a HTML Webpage


I want to write a python script which downloads the web-page only if the web-page contains HTML. I know that content-type in header will be used. Please suggest someway to do it as i am unable to get a way to get header before the file download.


Solution

  • Use http.client to send a HEAD request to the URL. This will return only the headers for the resource then you can look at the content-type header and see if it text/html. If it is then send a GET request to the URL to get the body.