Search code examples
bashwgetgnu-parallel

Check if a remote file exists in bash


I am downloading files with this script:

parallel --progress -j16 -a ./temp/img-url.txt 'wget -nc -q -P ./images/ {}; wget -nc -q -P ./images/ {.}_{001..005}.jpg'

Would it be possible to not download files, just check them on the remote side and if exists create a dummy file instead of downloading?

Something like:

if wget --spider $url 2>/dev/null; then
  #touch img.file
fi

should work, but I don't know how to combine this code with GNU Parallel.

Edit:

Based on Ole's answer I wrote this piece of code:

#!/bin/bash
do_url() {
  url="$1"
  wget -q -nc  --method HEAD "$url" && touch ./images/${url##*/}   
  #get filename from $url
  url2=${url##*/}
  wget -q -nc  --method HEAD ${url%.jpg}_{001..005}.jpg && touch ./images/${url2%.jpg}_{001..005}.jpg
}
export -f do_url

parallel --progress -a urls.txt do_url {}

It works, but it fails for some files. I can not find consistency why it works for some files, why it fails for others. Maybe it has something with the last filename. Second wget tries to access the currect url, but the touch command after that simply does not create the desidered file. First wget always (correctly) downloads the main image without the _001.jpg, _002.jpg.

Example urls.txt:

http://host.com/092401.jpg (works correctly, _001.jpg.._005.jpg are downloaded) http://host.com/HT11019.jpg (not works, only the main image is downloaded)


Solution

  • It is pretty hard to understand what it is you really want to accomplish. Let me try to rephrase your question.

    I have urls.txt containing:

    http://example.com/dira/foo.jpg
    http://example.com/dira/bar.jpg
    http://example.com/dirb/foo.jpg
    http://example.com/dirb/baz.jpg
    http://example.org/dira/foo.jpg
    

    On example.com these URLs exist:

    http://example.com/dira/foo.jpg
    http://example.com/dira/foo_001.jpg
    http://example.com/dira/foo_003.jpg
    http://example.com/dira/foo_005.jpg
    http://example.com/dira/bar_000.jpg
    http://example.com/dira/bar_002.jpg
    http://example.com/dira/bar_004.jpg
    http://example.com/dira/fubar.jpg
    http://example.com/dirb/foo.jpg
    http://example.com/dirb/baz.jpg
    http://example.com/dirb/baz_001.jpg
    http://example.com/dirb/baz_005.jpg
    

    On example.org these URLs exist:

    http://example.org/dira/foo_001.jpg
    

    Given urls.txt I want to generate the combinations with _001.jpg .. _005.jpg in addition to the original URL. E.g.:

    http://example.com/dira/foo.jpg
    

    becomes:

    http://example.com/dira/foo.jpg
    http://example.com/dira/foo_001.jpg
    http://example.com/dira/foo_002.jpg
    http://example.com/dira/foo_003.jpg
    http://example.com/dira/foo_004.jpg
    http://example.com/dira/foo_005.jpg
    

    Then I want to test if these URLs exist without downloading the file. As there are many URLs I want to do this in parallel.

    If the URL exists I want an empty file created.

    (Version 1): I want the empty file created in a the similar directory structure in the dir images. This is needed because some of the images have the same name, but in different dirs.

    So the files created should be:

    images/http:/example.com/dira/foo.jpg
    images/http:/example.com/dira/foo_001.jpg
    images/http:/example.com/dira/foo_003.jpg
    images/http:/example.com/dira/foo_005.jpg
    images/http:/example.com/dira/bar_000.jpg
    images/http:/example.com/dira/bar_002.jpg
    images/http:/example.com/dira/bar_004.jpg
    images/http:/example.com/dirb/foo.jpg
    images/http:/example.com/dirb/baz.jpg
    images/http:/example.com/dirb/baz_001.jpg
    images/http:/example.com/dirb/baz_005.jpg
    images/http:/example.org/dira/foo_001.jpg
    

    (Version 2): I want the empty file created in the dir images. This can be done because all the images have unique names.

    So the files created should be:

    images/foo.jpg
    images/foo_001.jpg
    images/foo_003.jpg
    images/foo_005.jpg
    images/bar_000.jpg
    images/bar_002.jpg
    images/bar_004.jpg
    images/baz.jpg
    images/baz_001.jpg
    images/baz_005.jpg
    

    (Version 3): I want the empty file created in the dir images called the name from urls.txt. This can be done because only one of _001.jpg .. _005.jpg exists.

    images/foo.jpg
    images/bar.jpg
    images/baz.jpg
    
    #!/bin/bash
    
    do_url() {
      url="$1"
    
      # Version 1:
      # If you want to keep the folder structure from the server (similar to wget -m):
      wget -q --method HEAD "$url" && mkdir -p images/"$2" && touch images/"$url"
    
      # Version 2:
      # If all the images have unique names and you want all images in a single dir
      wget -q --method HEAD "$url" && touch images/"$3"
    
      # Version 3:
      # If all the images have unique names when _###.jpg is removed and you want all images in a single dir
      wget -q --method HEAD "$url" && touch images/"$4"
    
    }
    export -f do_url
    
    parallel do_url {1.}{2} {1//} {1/.}{2} {1/} :::: urls.txt ::: .jpg _{001..005}.jpg
    

    GNU Parallel takes a few ms per job. When your jobs are this short, the overhead will affect the timing. If none of your CPU cores are running at 100% you can run more jobs in parallel:

    parallel -j0 do_url {1.}{2} {1//} {1/.}{2} {1/} :::: urls.txt ::: .jpg _{001..005}.jpg
    

    You can also "unroll" the loop. This will save 5 overheads per URL:

    do_url() {
      url="$1"
      # Version 2:
      # If all the images have unique names and you want all images in a single dir
      wget -q --method HEAD "$url".jpg && touch images/"$url".jpg
      wget -q --method HEAD "$url"_001.jpg && touch images/"$url"_001.jpg
      wget -q --method HEAD "$url"_002.jpg && touch images/"$url"_002.jpg
      wget -q --method HEAD "$url"_003.jpg && touch images/"$url"_003.jpg
      wget -q --method HEAD "$url"_004.jpg && touch images/"$url"_004.jpg
      wget -q --method HEAD "$url"_005.jpg && touch images/"$url"_005.jpg
    }
    export -f do_url
    
    parallel -j0 do_url {.} :::: urls.txt
    

    Finally you can run more than 250 jobs: https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Running-more-than-250-jobs-workaround