Search code examples
bashmagentocachingweb-crawler

Bash script cache warmer ignoring URLs in Magento XML sitemap?


I am trying to run a site crawler that uses my sitemap.xml I've got varnish running on magento and I would like to warm up the cache after cleaning it.

I`m using turpentine's warm cache script, but for some reason it gets 0 URLs.

My xml is here

I've researched a bit around it and I couldn't come up with a bash script that gets the url's in my xml.. unfortunately i`m no linux guru. Can you help me with some tips / documentation links ? Any help would be much appreciated, thank you.

Later edit :

When I run warm-cache.sh i get

 Getting URLs from sitemap... 
 Warming 0 URLs using 4 processes...

I also found a nice crawl script:

wget -O - easyfarm.ro/sitemap.xml | grep -E -o '<loc>.*</loc>' | sed -e 's/<loc>//g' -e 's/<\/loc>//g' | wget -i - -p -r -leve=2 --delete-after

However, it does not access any urls either, i get:

--2013-11-19 16:53:16--  http://easyfarm.ro/sitemap.xml
Resolving easyfarm.ro (easyfarm.ro)... 188.240.47.148
Connecting to easyfarm.ro (easyfarm.ro)|188.240.47.148|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/xml]
Saving to: `STDOUT'

    [ <=>                                                                                             ] 7,703       --.-K/s   in 0s

2013-11-19 16:53:17 (883 MB/s) - written to stdout [7703]

Solution

  • Make sure you have xpath installed and available to the script.

    More generally, make sure every command called in the script is available - xpath, curl, sed, grep, cat, xargs, siege, rm. Some of these are available by default on most systems, some again are not.

    The installation procedure varies with each distribution. For example, in Ubuntu Linux you would use apt-get install libxml-xpath-perl to get xpath.