Search code examples
bashwgetbackground-process

Parallel wget download files does not exit properly


I am trying to download files from a file (test.txt) containing links (over 15 000+).

I have this script:

#!/bin/bash

function download {

FILE=$1

while read line; do
        url=$line

        wget -nc -P ./images/ $url

        #downloading images which are not in the test.txt, 
        #by guessing name: 12345_001.jpg, 12345_002.jpg..12345_005.jpg etc.

        wget -nc  -P ./images/ ${url%.jpg}_{001..005}.jpg
done < $FILE

}  

#test.txt contains the URLs
split -l 1000 ./temp/test.txt ./temp/split

#read splitted files and pass to the download function
for f in ./temp/split*; do
    download $f &
done

test.txt:

http://xy.com/12345.jpg
http://xy.com/33442.jpg
...

I am splitting the file into few pieces and daemonize (download $f &) the wget process so it can jump to another splitted file containing the links.

Script is working, but the script does not exit at the end, I must press enter at the end. If I remove & from the line download $f & it works, but I loose the parallel downloading.

Edit:

As I found this is not the best way to parallelize wget downloads. It would be great to use GNU Parallel.

enter image description here


Solution

  • May I commend GNU Parallel to you?

    parallel --dry-run -j32 -a URLs.txt 'wget -ncq -P ./images/ {}; wget -ncq  -P ./images/ {.}_{001..005}.jpg'
    

    I am only guessing what your input file looks like in URLs.txt as something resembling:

    http://somesite.com/image1.jpg
    http://someothersite.com/someotherimage.jpg
    

    Or, using your own approach with a function:

    #/bin/bash
    
    # define and export a function for "parallel" to call
    doit(){
       wget -ncq -P ./images/ "$1"
       wget -ncq -P ./images/ "$2_{001..005}.jpg"
    }
    export -f doit
    
    parallel --dry-run  -j32 -a URLs.txt doit {} {.}