Search code examples
bashfor-loopcurlparallel-processingxargs

Bash parallel curl requests from array of URLS inside for loop


everyone!

I'm trying to make parallel curl requests from array of urls to speed up bash script. After research I found out, that there are several approaches: GNU parallel, x-args, built-in curl options (--parallel) starting from 7.68.0, using ampersand. Best option would be not to use GNU parallel, since it demands GNU installation and I'm not allowed to do it.

Here is my initial script:

#!/bin/bash

external_links=(https://www.designernews.co/error https://www.awwwards.com/error1/ https://dribbble.com/error1 https://www.designernews.co https://www.awwwards.com https://dribbble.com)
invalid_links=()

for external_link in "${external_links[@]}"
  do
      curlResult=$(curl -sSfL --max-time 60 --connect-timeout 30 --retry 3 -4 "$external_link" 2>&1 > /dev/null) && status=$? || status=$?
      if [ $status -ne 0 ] ; then
          if [[ $curlResult =~ (error: )([0-9]{3}) ]]; then
              error_code=${BASH_REMATCH[0]}
              invalid_links+=("${error_code} ${external_link}")
              echo "${external_link}"
          fi
      fi
      i=$((i+1))
  done

echo "Found ${#invalid_links[@]} invalid links: "
printf '%s\n' "${invalid_links[@]}"

I tried to change the curl options, added x-args, ampersand, but failed to succeed. All examples I found were mostly using GNU or reading data from the file, none of them worked with variable containg array of URLs (Running programs in parallel using xargs, cURL with variables and multiprocessing in shell). Could you, please, help me with this issue?


Solution

  • The main problem is that you can't modify your array directly from a sub-process. A possible work-around is to use a FIFO file for transmitting the results of the sub-processes to the main program.
    remark: As long as the messages are shorter than getconf SSIZE_MAX bytes, the writes to the FIFO file are guaranteed to be atomic.

    #!/bin/bash
    
    tempdir=$(mktemp -d) &&
    mkfifo "$tempdir/fifo" &&
    exec 3<> "$tempdir/fifo" || exit 1
    
    external_links=(
        https://www.designernews.co/error
        https://www.awwwards.com/error1/
        https://dribbble.com/error1
        https://www.designernews.co
        https://www.awwwards.com
        https://dribbble.com
    )
    
    for url in "${external_links[@]}"
    do
        {
            curl -4sLI -o /dev/null -w '%{http_code}' --max-time 60 --connect-timeout 30 --retry 2 "$url"
            echo "/$url"
        } >&3 &
    done
    
    invalid_links=()
    
    for (( i = ${#external_links[@]}; i > 0; i-- ))
    do
        IFS='/' read -u 3 -r http_code url
        (( 200 <= http_code && http_code <= 299 )) || invalid_links+=( "$http_code $url" )
    done
    
    echo "Found ${#invalid_links[@]} invalid links:"
    (( ${#invalid_links[@]} > 0 )) && printf '%s\n' "${invalid_links[@]}"
    

    remarks:

    • Here I consider a link valid when curl's output is in the 200-299 range.
    • When the web server doesn't exists or reply, curl's output is 000

    UPDATE

    Limiting the number of curl parallel requests to 10.

    It's possible to add the logic for that with the shell but if you have BSD or GNU xargs then you can replace the whole for url in "${external_links[@]}"; do ...; done loop with:

    printf '%s\0' "${external_links[@]}" |
    
    xargs -0 -P 10 -n 1 sh -c '
        curl -4sLI \
             -o /dev/null \
             -w "%{http_code}" \
             --max-time 60 \
             --connect-timeout 30 \
             --retry 2 "$0"
        printf /%s\\n "$0"
    ' 1>&3
    

    Furthermore, if your curl is at least 7.75.0 then you should able to replace the sh -c '...' with a single curl command (untested):

    printf '%s\0' "${external_links[@]}" |
    
    xargs -0 -P 10 -n 1 \
        curl -4sLI \
             -o /dev/null \
             -w "%{http_code}/%{url}\n" \
             --max-time 60 \
             --connect-timeout 30 \
             --retry 2 \
    1>&3