I am trying to run a site crawler that uses my sitemap.xml I've got varnish running on magento and I would like to warm up the cache after cleaning it.
I`m using turpentine's warm cache script, but for some reason it gets 0 URLs.
My xml is here
I've researched a bit around it and I couldn't come up with a bash script that gets the url's in my xml.. unfortunately i`m no linux guru. Can you help me with some tips / documentation links ? Any help would be much appreciated, thank you.
Later edit :
When I run warm-cache.sh i get
Getting URLs from sitemap...
Warming 0 URLs using 4 processes...
I also found a nice crawl script:
wget -O - easyfarm.ro/sitemap.xml | grep -E -o '<loc>.*</loc>' | sed -e 's/<loc>//g' -e 's/<\/loc>//g' | wget -i - -p -r -leve=2 --delete-after
However, it does not access any urls either, i get:
--2013-11-19 16:53:16-- http://easyfarm.ro/sitemap.xml
Resolving easyfarm.ro (easyfarm.ro)... 188.240.47.148
Connecting to easyfarm.ro (easyfarm.ro)|188.240.47.148|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/xml]
Saving to: `STDOUT'
[ <=> ] 7,703 --.-K/s in 0s
2013-11-19 16:53:17 (883 MB/s) - written to stdout [7703]
Make sure you have xpath
installed and available to the script.
More generally, make sure every command called in the script is available - xpath
, curl
, sed
, grep
, cat
, xargs
, siege
, rm
. Some of these are available by default on most systems, some again are not.
The installation procedure varies with each distribution. For example, in Ubuntu Linux you would use apt-get install libxml-xpath-perl
to get xpath
.