Search code examples
autosys

Autosys Job Queue


I'm trying to set1 an autosys jobs configuration so that will have a "funnel" job queue behavior, or, as I call it, in a 'waterdrops' pattern, each job executing in sequence after a given time interval, with local job failure not cascading into sequence failure.

1 (ask for it to be setup, actually, as I do not control the Autosys machine)


Constraints

  • I have an (arbitrary) N jobs (all executing on success of job A)
    • For this discussion, lets say three (B1, B2, B3)
    • Real production numbers might go upward of 100 jobs.
    • All these jobs won't be created at the same time, so addition of a new job should be as less painful as possible.
  • None of those should execute simultaneously.
    • Not actually a direct problem for our machine
    • But side effect on a remote, client machine : jobs include file transfer, which are trigger-listened to on client machine, which doesn't handle well.
      • Adaptation of client-machine behavior is, unfortunately, not possible.
  • Failure of job is meaningless to other jobs.
  • There should be a regular delay in between each job
    • This is a soft requirement in that, our jobs being batch scripts, we can always append or prepend a sleep command.
    • I'd rather, however have a more elegant solution especially if the delay is centralised : a parameter - that could be set to greater values, should the need arise.

State of my reasearch

Legend

A(s) : Success status of job
A(d) : Done status of job

Solution 1 : Unfailing sequence

This is the current "we should pick this solution" solution.

A (s) --(delay D)--> B(d) --(delay D)--> B2(d) --(delay D)--> B3 ...

Pros :

  • Less bookeeping than solution 2

Cons :

  • Bookeeping of the (current) tailing job
  • Sequence doesn't resist to job being ON HOLD (ON ICE is fine).

Solution 2 : Stairway parallelism

A(s) ==(delay D)==> B1
A(s) ==(delay D x2)==> B2
A(s) ==(delay D x3)==> B3
...

Pros :

  • Jobs can be put ON HOLD without incidence.

Cons :

  • Bookeeping to know "who is when" (and what's the next delay to implement)
  • N jobs executed at the same time
  • Underlying race condition created ++ Risk of overlap of job execution, especially if small delays accumulates

Solution 3 : The Miracle Box ?

I have read a bit about Job Boxes, but the specific details eludes me.

-----------------
A(s) ====> | B1, B2, B3 |
-----------------

  • Can we limit the number of concurrent executions of jobs of a box (i.e a box-local max_load, if I understand that parameter) ?

Pros :

  • Adding jobs would be painless
  • Little to no bookeeping (box name, to add new jobs - and it's constant)
  • Jobs can be put ON HOLD without incidence (unless I'm mistaken)

Cons :

  • I'm half-convinced it can't be done (but that's why I'm asking you :) )
  • ... any other problem I have failed to forseen

My questions to SO

  1. Is Solution 3 a possibility, and if yes, what are the specific commands and parameters for implementing it ?
  2. Am I correct in favoring Solution 1 over Solution 2 otherwise2 ?
  3. An alternative solution fitting in the constraints is of course more than welcome!

Thanks in advance,
Best regards

PS: By the way, is all of this a giant race condition manager for the remote machine failing behavior ?
Yes, it is.

2 I'm aware it skirts a bit toward the "subjective" part of questions rejection rules, but I'm asking it in regards to the solution(s) correctness toward my (arguably) objective constraints.


Solution

  • I would suggest you to do below

    1. Put all the jobs (B1,B2,B3) in a box job B.
    2. Create another job (say M1) which would run on success of A. This job will call a shell/perl script (say forcejobs.sh)
    3. The shell script will get a list of all the jobs in B and start a loop with a sleep interval of delay period. Inside loop it would force start jobs one by one after the delay period.

      So outline of script would be

        get all the jobs in B
        for each job start for loop
             force start the job
        sleep for delay interval
      
    4. At the end of the loop, when all jobs are successfully started, you can use an infinite loop and keep checking status of jobs. Once all jobs are SU/FA or whatever, you can end the script and send the result to you/stdout and finish the job M1.