Every day every 6 hours I have to download bz2
files from a web server, decompress them and merge them into a single file. This needs to be as efficient and quick as possible as I have to wait for the download and decompress phase to complete before proceeding further with the merging.
I have wrote some bash functions which take as input some strings to construct a URL of the files to be downloaded as a matching pattern. This way I can pass the matching pattern directly to wget
without having to build locally the server's contents list to be then passed as a list with -i
to wget
. My function looks something like this
parallelized_extraction(){
i=0
until [ `ls -1 ${1}.bz2 2>/dev/null | wc -l ` -gt 0 -o $i -ge 30 ]; do
((i++))
sleep 1
done
while [ `ls -1 ${1}.bz2 2>/dev/null | wc -l ` -gt 0 ]; do
ls ${1}.bz2| parallel -j+0 bzip2 -d '{}'
sleep 1
done
}
download_merge_2d_variable()
{
filename="file_${year}${month}${day}${run}_*_${1}.grib2"
wget -b -r -nH -np -nv -nd -A "${filename}.bz2" "url/${run}/${1,,}/"
parallelized_extraction ${filename}
# do the merging
rm ${filename}
}
which I call as download_merge_2d_variable name_of_variable
I was able to speed up the code by writing the function parallelized_extraction
which takes care of decompressing the downloaded files while wget
is running in the background. To do this I first wait for the first .bz2
file to appear, then run the parallelized extraction until the last .bz2
is present on the server (this is what the two until
and while
loops are doing).
I'm pretty happy with this approach, however I think it could be improved. Here are my questions:
wget
to perform as well parallel downloads if my list of files is given as a matching pattern? Do I have to write multiple matching patterns with "chunks" of data inside or do I necessarily have to download a contents list from the server, split this list and then give it as input to wget
?parallelized_extraction
may fail if the download of files is really slow, as it will not find any new bz2
file to be extracted and exit from the loop at the next iteration, although wget
is still running in the background. Although this never happened to me it is a possibility. To take care of that I tried to add a condition to the second while by getting the PID
of wget
running in the background to check if it's still there but somehow it is not workingparallelized_extraction(){
# ...................
# same as before ....
# ...................
while [ `ls -1 ${1}.bz2 2>/dev/null | wc -l ` -gt 0 -a kill -0 ${2} >/dev/null 2>&1 ]; do
ls ${1}.bz2| parallel -j+0 bzip2 -d '{}'
sleep 1
done
}
download_merge_2d_variable()
{
filename="ifile_${year}${month}${day}${run}_*_${1}.grib2"
wget -r -nH -np -nv -nd -A "${filename}.bz2" "url/${run}/${1,,}/" &
# get ID of process running in background
PROC_ID=$!
parallelized_extraction ${filename} ${PROC_ID}
# do the merging
rm ${filename}
}
Any clue to why this is not working? Any suggestions on how to improve my code? Thanks
UPDATE I'm posting here my working solution based on the accepted answer in case someone is interested.
# Extract a plain list of URLs by using --spider option and filtering
# only URLs from the output
listurls() {
filename="$1"
url="$2"
wget --spider -r -nH -np -nv -nd --reject "index.html" --cut-dirs=3 \
-A $filename.bz2 $url 2>&1\
| grep -Eo '(http|https)://(.*).bz2'
}
# Extract each file by redirecting the stdout of wget to bzip2
# note that I get the filename from the URL directly with
# basename and by removing the bz2 extension at the end
get_and_extract_one() {
url="$1"
file=`basename $url | sed 's/\.bz2//g'`
wget -q -O - "$url" | bzip2 -dc > "$file"
}
export -f get_and_extract_one
# Here the main calling function
download_merge_2d_variable()
{
filename="filename.grib2"
url="url/where/the/file/is/"
listurls $filename $url | parallel get_and_extract_one {}
# merging and processing
}
export -f download_merge_2d_variable_icon_globe
Can you list the urls to download?
listurls() {
# do something that lists the urls without downloading them
# Possibly something like:
# lynx -listonly -image_links -dump "$starturl"
# or
# wget --spider -r -nH -np -nv -nd -A "${filename}.bz2" "url/${run}/${1,,}/"
# or
# seq 100 | parallel echo ${url}${year}${month}${day}${run}_{}_${id}.grib2
}
get_and_extract_one() {
url="$1"
file="$2"
wget -O - "$url" | bzip2 -dc > "$file"
}
export -f get_and_extract_one
# {=s:/:_:g; =} will generate a file name from the URL with / replaced by _
# You probably want something nicer.
# Possibly just {/.}
listurls | parallel get_and_extract_one {} '{=s:/:_:g; =}'
This way you will decompress while downloading and doing all in parallel.