Search code examples
urldownloadwgetcustom-error-pages

Downloading multiple files in a directory using wget


I would like to download multiple images in a directory using wget. Here's an image,

https://pkmnbinder.com/images/PAL/001.png

I can download this using

Wget "https://pkmnbinder.com/images/PAL/001.png"

. However when I try something like,

Wget -r "https://pkmnbinder.com/images/PAL/"

I get this error message:

ERROR 404: Not Found.

If I visit "https://pkmnbinder.com/images/PAL/" in a web browser, I'm also met with a 404 page. How can I download all image files over at "https://pkmnbinder.com/images/PAL/"? I understand there's an -i option with wget, but I'll first need to know what image files exist in the directory if I want to make a file containing all the filenames.

I tried

Wget -r "https://pkmnbinder.com/images/PAL/"

and expected all images in the directory to be downloaded, but I got a 404 error instead. I've also tried other options such as -np -nv -4 -R index.html to no avail.

Thanks in advance!


Solution

  • There is no direct way to retrieve a list of all files in a directory hosted by a server, unless the server exposes a directory listing page, which, as you have noticed, this server does not — https://pkmnbinder.com/images/PAL/ responds with a 404 NOT FOUND.

    For the particular example you provided, it is not too difficult to infer the names of the images within the directory: they are incrementing 3-digit number strings (001.png, 002.png, etc.). So it would suffice to write a loop to go through all strings starting from 001 up until there are no more images, i.e. the server responds with a 404 NOT FOUND. The loop can be implemented in any language you like, but I will stick with bash here, as you will likely already have that installed:

    #!/usr/bin/env bash
    
    image=1
    status=0
    
    while [ $status -eq 0 ] # loop while status is 0
    do
        printf -v image_id "%03d" $image # left-pad image number with zeros
        wget "https://pkmnbinder.com/images/PAL/$image_id.png"
        status=$? # save exit code of wget to status variable - non-zero exit code means wget failed
        ((image++)) # increment image number
    done
    

    To run the script, you can save it to a file, chmod +x the file, and execute it. E.g. if I have the script saved to a file called download, I would execute the following commands in the same directory as the download script:

    chmod +x download
    ./dowload
    

    This should download the images you are looking for.

    If you are looking for a more general solution to this problem, I would suggest you have a look at web scraping, which again can be implemented in a language of your choosing. For the particular site you provided, I can see that it contains many collections of images (the Pokemon cards). You can programmatically iterate through the collections, and for each one query for all the images available. e.g. after selecting one of the collections, I can execute the following JavaScript to get all the URLs:

    Array.from(document.querySelectorAll("img")).map(img => img.src)
    

    After you have the URLs scraped, you can download the images in any way you like (e.g. using wget).

    I understand your question was how to achieve this behaviour using just wget, however, I think this is not possible with the example site you provided, and a bit more effort is required e.g. using the methods I have described above.