Search code examples
bashamazon-s3gnu-parallel

bash script with parallel execution


I am trying to use parallel in a bash script, to verify if s3 path exists or not and I am trying to verify multiple s3 paths, by counting the objects in the path. If the count of the object is zero it will continue to the next date in the for loop, with parallel it is not working as expected.

For Date range I provided in the for loop, we actually don't have those folders in the s3bucket, and in the function checkS3Path if s3 path doesnt exists, I am creating a 0KB file, but I dont see those 0KB files being created after script is executed. From the output of the script, I am seeing S3 Path Consists CSV Files, Proceeding to next step folder1:+2019-10-03, instead of S3 Path Doesnt Exists folder1:+2019-10-03. Please see the output below.

please let me what might be the issue.

Here is the sample code.

#!/bin/bash
#set -x
s3Bucket=testbucket
version=v20
Array=(folder1 folder2 folder3)

checkS3Path() {
  fldName=$1
  date=$2
  objectNum=$(aws s3 ls s3://${s3Bucket}/${version}/${fldName}/date=${date}/ | wc -l)
  echo $objectNum
  if [ "$objectNum" -eq  0 ]
  then
    echo "S3 Path Doesnt Exists ${fldName}:${date}" >> /app/${fldName}.log
    touch /home/ubuntu/${fldName}_${date}.txt
    continue
  else
    echo "S3 Path Consists csv Files, Proceeding to next step ${fldName}:${date}"
  fi
}

final() {
  fldName=$1
  date=$2
  checkS3Path $fldName $date
  function2 $fldName $date
  function3 $fldName $date
}

export -f final checkS3Path

for date in 2019-10-{01..03}
do
#  finalstep folder1 $date
  parallel --jobs 4 --eta finalstep ::: "${Array[@]}" ::: +"$date"
done

Here is the output I am seeing.

$ ./test.sh
Academic tradition requires you to cite works you base your article on.
When using programs that use GNU Parallel to process data for publication
please cite:

  O. Tange (2011): GNU Parallel - The Command-Line Power Tool,
  ;login: The USENIX Magazine, February 2011:42-47.

This helps funding further development; AND IT WON'T COST YOU A CENT.
If you pay 10000 EUR you should feel free to use GNU Parallel without citing.

To silence this citation notice: run 'parallel --citation'.


Computers / CPU cores / Max jobs to run
1:local / 4 / 4

Computer:jobs running/jobs completed/%of started jobs/Average seconds to complete
ETA: 0s Left: 14 AVG: 0.00s  local:4/0/100%/0.0s 202
S3 Path Consists CSV Files, Proceeding to next step folder1:+2019-10-01
ETA: 0s Left: 13 AVG: 0.00s  local:4/1/100%/2.0s 202
S3 Path Consists CSV Files, Proceeding to next step folder2:+2019-10-01
ETA: 0s Left: 12 AVG: 0.00s  local:4/2/100%/1.0s 202
S3 Path Consists CSV Files, Proceeding to next step folder3:+2019-10-01
Academic tradition requires you to cite works you base your article on.
When using programs that use GNU Parallel to process data for publication
please cite:

  O. Tange (2011): GNU Parallel - The Command-Line Power Tool,
  ;login: The USENIX Magazine, February 2011:42-47.

This helps funding further development; AND IT WON'T COST YOU A CENT.
If you pay 10000 EUR you should feel free to use GNU Parallel without citing.

To silence this citation notice: run 'parallel --citation'.


Computers / CPU cores / Max jobs to run
1:local / 4 / 4

Computer:jobs running/jobs completed/%of started jobs/Average seconds to complete
ETA: 0s Left: 14 AVG: 0.00s  local:4/0/100%/0.0s 202
S3 Path Consists CSV Files, Proceeding to next step folder1:+2019-10-02
ETA: 0s Left: 13 AVG: 0.00s  local:4/1/100%/0.0s 202
S3 Path Consists CSV Files, Proceeding to next step folder2:+2019-10-02
ETA: 6s Left: 12 AVG: 0.50s  local:4/2/100%/0.5s 202
S3 Path Consists CSV Files, Proceeding to next step folder3:+2019-10-02
ETA: 3s Left: 11 AVG: 0.33s  local:4/3/100%/0.3s 202
Academic tradition requires you to cite works you base your article on.
When using programs that use GNU Parallel to process data for publication
please cite:

  O. Tange (2011): GNU Parallel - The Command-Line Power Tool,
  ;login: The USENIX Magazine, February 2011:42-47.

This helps funding further development; AND IT WON'T COST YOU A CENT.
If you pay 10000 EUR you should feel free to use GNU Parallel without citing.

To silence this citation notice: run 'parallel --citation'.


Computers / CPU cores / Max jobs to run
1:local / 4 / 4

Computer:jobs running/jobs completed/%of started jobs/Average seconds to complete
ETA: 0s Left: 14 AVG: 0.00s  local:4/0/100%/0.0s 202
S3 Path Consists CSV Files, Proceeding to next step folder1:+2019-10-03
ETA: 0s Left: 13 AVG: 0.00s  local:4/1/100%/1.0s 202
S3 Path Consists CSV Files, Proceeding to next step folder2:+2019-10-03
ETA: 0s Left: 12 AVG: 0.00s  local:4/2/100%/0.5s 202
S3 Path Consists CSV Files, Proceeding to next step folder3:+2019-10-03
ETA: 0s Left: 11 AVG: 0.00s  local:4/3/100%/0.3s 202

$

Thanks


Solution

  • If checkS3Path works when run by hand, then you probably just need to:

    export s3Bucket=testbucket
    export version=v20
    

    Each GNU Parallel job runs in its own shell (started from Perl) which is the reason you need to export variables, if you want them to be visible to the job.

    Also look at env_parallel to do this automatically.