Search code examples
grepcygwinwget

Merge these wget & egrep commands for recursive download of sitemap


I am trying to find a way to make these work together. Whereas I can run this successfully using Wget for Windows:

wget --html-extension -r http://www.sitename.com

this downloads every single file on my server that is directory linked from the root domain. I'd rather download only the pages in my sitemap. For this, I found the following trick which uses CygWin:

wget --quiet https://www.sitename.com/sitemap.xml --output-document - | egrep -o
"http://www\.sitename\.com[^<]+" | wget --spider -i - --wait 1

However this is only checking that the pages exist, not downloading them as static HTML files as the prior wget command is doing.

Is there a way to merge these and download the sitemap pages as local html files?


Solution

  • If you look at the man page for wget, you will see that the --spider entry is as follows:

    --spider
           When invoked with this option, Wget will behave as a Web spider, which means that it will not download the pages, just check that they are there.
    

    All you need to do to actually download the file is remove the --spider from your command.

    wget --quiet https://www.sitename.com/sitemap.xml --output-document - | egrep -o \
    "https?://www\.sitename\.com[^<]+" | wget -i - --wait 1