Search code examples
htmlweb-scrapingwgethttrack

How do I get httrack to save files with their original names rather than index****.html?


I'm following the HTTrack docs example here: http://httrack.kauler.com/help/User-defined_structure

The site I need to scrape has URLs in this structure:

https://www.example.com/index.php?HelpTopics

https://www.example.com/index.php?MoreHelp

etc.

With HTTrack, I want to download the site save the files in the format

HelpTopics.html MoreHelp.html etc.

I'm using this on the command line modified from the docs linked above:

httrack "https://www.example.com" %n%[index.php?:-:::].%t

but I still get all files saved as index2b26.html and index2de7.html etc.

What am I doing wrong with the HTTrack options? Is this breaking because there are no file extensions on the original site example.com?


Solution

  • I found it's much easier to use wget to save file with their original names. This does it:

    wget --mirror -p --convert-links --content-disposition --trust-server-names -P examplefolder http://www.example.com