Search code examples
imagedownloadbulk

Downloading Images with "illegal" characters


I am migrating a shop over for a client.

I have to pull all the old image files off her 'shop' which has no FTP access.

It allowed me to export a list of filenames/urls. My plan was to load them up in Firefox and use "Downloadthemall" to simply download all the files. (Around 2000). However about 1 1/3 have [ and ] in.

i.e.

cdn.crapshop.com/images/image[1].jpg

Downloadthemall freaks out and only reads it as

cdn.crapshop.com/images/image

And won't download it because it isn't a file.

Anyone got any ideas of an alternative way to pull a list like this?


Solution

  • See this solution that explains why the example URL you provided is invalid: Validation. After you look at that post you'll see that, in the answer provided by @good, you have to encode characters that are not according to the specification using percent encoding, so the webserver will understand them.

    This calls for python... see this post: Percent encoding in python

    And then we can put it all together in a script, which you will use to read from stdin and output to stdout: python script.py < input > output.out.

    import urllib, sys
    
    while 1:
        try:
            line = sys.stdin.readline()
    
        except KeyboardInterrupt:
            break
    
        if not line:
            break
    
        print urllib.quote(line.strip(), safe=':').strip('\'')
    

    Then, hopefully, download them all will parse that list of files (the input to that script is supposed to be a list of url's separated by a newline) that have been corrected by the script.

    You may be interested in this post as well: Downloading files with python. Which shows you how to download files (web pages in particular) using python.

    Good luck!