Search code examples
bashwgetgnu-parallelbzip2

Efficient parallel downloading and decompressing with matching pattern for list of files on server


Every day every 6 hours I have to download bz2 files from a web server, decompress them and merge them into a single file. This needs to be as efficient and quick as possible as I have to wait for the download and decompress phase to complete before proceeding further with the merging.

I have wrote some bash functions which take as input some strings to construct a URL of the files to be downloaded as a matching pattern. This way I can pass the matching pattern directly to wget without having to build locally the server's contents list to be then passed as a list with -i to wget. My function looks something like this

parallelized_extraction(){
    i=0
    until [ `ls -1 ${1}.bz2 2>/dev/null | wc -l ` -gt 0 -o $i -ge 30 ]; do
        ((i++))
        sleep 1
    done
    while [ `ls -1 ${1}.bz2 2>/dev/null | wc -l ` -gt 0 ]; do
        ls ${1}.bz2| parallel -j+0 bzip2 -d '{}' 
        sleep 1
    done
}
download_merge_2d_variable()
{
    filename="file_${year}${month}${day}${run}_*_${1}.grib2"
    wget -b -r -nH -np -nv -nd -A "${filename}.bz2" "url/${run}/${1,,}/"
    parallelized_extraction ${filename}
    # do the merging 
    rm ${filename}
} 

which I call as download_merge_2d_variable name_of_variable I was able to speed up the code by writing the function parallelized_extraction which takes care of decompressing the downloaded files while wget is running in the background. To do this I first wait for the first .bz2 file to appear, then run the parallelized extraction until the last .bz2 is present on the server (this is what the two until and while loops are doing).

I'm pretty happy with this approach, however I think it could be improved. Here are my questions:

  • how can I launch multiple instances of wget to perform as well parallel downloads if my list of files is given as a matching pattern? Do I have to write multiple matching patterns with "chunks" of data inside or do I necessarily have to download a contents list from the server, split this list and then give it as input to wget?
  • parallelized_extraction may fail if the download of files is really slow, as it will not find any new bz2 file to be extracted and exit from the loop at the next iteration, although wget is still running in the background. Although this never happened to me it is a possibility. To take care of that I tried to add a condition to the second while by getting the PID of wget running in the background to check if it's still there but somehow it is not working
parallelized_extraction(){
    # ...................
    # same as before ....
    # ...................
    while [ `ls -1 ${1}.bz2 2>/dev/null | wc -l ` -gt 0 -a kill -0 ${2} >/dev/null 2>&1 ]; do
        ls ${1}.bz2| parallel -j+0 bzip2 -d '{}' 
        sleep 1
    done
}
download_merge_2d_variable()
{
    filename="ifile_${year}${month}${day}${run}_*_${1}.grib2"
    wget -r -nH -np -nv -nd -A "${filename}.bz2" "url/${run}/${1,,}/" &
    # get ID of process running in background
    PROC_ID=$!
    parallelized_extraction ${filename} ${PROC_ID}
    # do the merging
    rm ${filename}
}

Any clue to why this is not working? Any suggestions on how to improve my code? Thanks

UPDATE I'm posting here my working solution based on the accepted answer in case someone is interested.

# Extract a plain list of URLs by using --spider option and filtering
# only URLs from the output 
listurls() {
    filename="$1"
    url="$2"
    wget --spider -r -nH -np -nv -nd --reject "index.html" --cut-dirs=3 \
        -A $filename.bz2 $url 2>&1\
        | grep -Eo '(http|https)://(.*).bz2'
}
# Extract each file by redirecting the stdout of wget to bzip2
# note that I get the filename from the URL directly with
# basename and by removing the bz2 extension at the end 
get_and_extract_one() {
  url="$1"
  file=`basename $url | sed 's/\.bz2//g'`
  wget -q -O - "$url" | bzip2 -dc > "$file"
}
export -f get_and_extract_one
# Here the main calling function 
download_merge_2d_variable()
{
    filename="filename.grib2"
    url="url/where/the/file/is/"
    listurls $filename $url | parallel get_and_extract_one {}
    # merging and processing
}
export -f download_merge_2d_variable_icon_globe

Solution

  • Can you list the urls to download?

    listurls() {
      # do something that lists the urls without downloading them
      # Possibly something like:
      # lynx -listonly -image_links -dump "$starturl"
      # or
      # wget --spider -r -nH -np -nv -nd -A "${filename}.bz2" "url/${run}/${1,,}/"
      # or
      # seq 100 | parallel echo ${url}${year}${month}${day}${run}_{}_${id}.grib2
    }
    
    get_and_extract_one() {
      url="$1"
      file="$2"
      wget -O - "$url" | bzip2 -dc > "$file"
    }
    export -f get_and_extract_one
    
    # {=s:/:_:g; =} will generate a file name from the URL with / replaced by _
    # You probably want something nicer.
    # Possibly just {/.}
    listurls | parallel get_and_extract_one {} '{=s:/:_:g; =}'
    

    This way you will decompress while downloading and doing all in parallel.