Search code examples
javascriptpythonhtmlcssscreen-scraping

Save an html page + change all links to point to the right place


You probably know that IE has this thing where you can save a web page, and it will automatically download the html file and all he image/css/js files that the html file uses.

Now there is one problem with this- the links in the html file are not changed. So if I download the html page of example.com, which has an < a href=/hi.html> the page that I downloaded with IE will have a link to C:\Documents and Settings...(path to the folder that the html file is in).

Is there a python library that will download an html page for me, with all the contents of it (images/js/css) too? If yes, is there a library that will also change the links for me?

Thanks!!


Solution

  • Since you're mentioning IE specifically, I'm not sure if this is gonna be of any use to you, but on linux the easiest way to completely mirror a website is with the wget command.

    wget --mirror --convert-links -w 1 http://www.example.com
    

    Run man wget if you need more options.