Search code examples
web-scrapingcurlwget

How to download image and save image name based on URL?


How do I download all images from a web page and prefix the image names with the web page's URL (all symbols replaced with underscores)?

For example, if I were to download all images from http://www.amazon.com/gp/product/B0029KH944/, then the main product image would be saved using this filename:

www_amazon_com_gp_product_B0029KH944_41RaFZ6S-0L._SL500_AA300_.jpg

I have installed WinHTTrack and wget and spent more time than it's probably worth to get them to do what I wanted, but I was not successful, so Stack Overflow is my last ditch effort. (WinHTTrack came close if you set the build option to save files according to site structure and write a script to rename files based on their parent directories, but the problem is that the main image is hosted on a different domain.)


Solution

  • Well, I added a download option to my new Webscraper.

    With that you can do it like this:

     xidel "http://www.amazon.com/dp/B0029KH944/" -e 'site:=translate(filter($_url, "http://(.*)", 1), "/.", "__")'  -f //img -e 'image:=filter($_url, ".*/(.*)", 1)' --download '$site;$image;'
    

    First -e reads the url and removes the /. characters, the -f selects all imgs, the -e second reads the filenames and --download downloads it then...

    ALthough it has the disadvantage that it tries to parse every image as an html file, which could slow it down a little bit...