Downloading Images with "illegal" characters

I am migrating a shop over for a client.

I have to pull all the old image files off her 'shop' which has no FTP access.

It allowed me to export a list of filenames/urls. My plan was to load them up in Firefox and use "Downloadthemall" to simply download all the files. (Around 2000). However about 1 1/3 have [ and ] in.

i.e.

cdn.crapshop.com/images/image[1].jpg

Downloadthemall freaks out and only reads it as

cdn.crapshop.com/images/image

And won't download it because it isn't a file.

Anyone got any ideas of an alternative way to pull a list like this?

Solution

See this solution that explains why the example URL you provided is invalid: Validation. After you look at that post you'll see that, in the answer provided by @good, you have to encode characters that are not according to the specification using percent encoding, so the webserver will understand them.

This calls for python... see this post: Percent encoding in python

And then we can put it all together in a script, which you will use to read from stdin and output to stdout: python script.py < input > output.out.

import urllib, sys

while 1:
    try:
        line = sys.stdin.readline()

    except KeyboardInterrupt:
        break

    if not line:
        break

    print urllib.quote(line.strip(), safe=':').strip('\'')

Then, hopefully, download them all will parse that list of files (the input to that script is supposed to be a list of url's separated by a newline) that have been corrected by the script.

You may be interested in this post as well: Downloading files with python. Which shows you how to download files (web pages in particular) using python.

Good luck!