Search code examples
postgresqlcronplpgsqlpg-cron

How to run Postgres pg_cron Job AFTER another Job?


I running some automated tasks on my postgres database at night using the pg_cron extension. I am moving certain old records to archive database tables. I am running 5 Stored Procedures concurrently on 5 different background workers, so they all start at the same time and run on different workers (I am assuming this is similar to running different Tasks on different Threads in Java). These 5 Stored Procedures are independent (moving records to archive tables), so they can run at the same time. I schedule them each using a command like

cron.schedule (myJob1,
    '* * * * *',
    'call my_stored_proc_1()'
);

cron.schedule (myJob2,
    '* * * * *',
    'call my_stored_proc_2()'
);

.
..
...

cron.schedule (myJob5,
    '* * * * *',
    'call my_stored_proc_5()'
);

NOW, I have some MORE dependent Store Procedures that I want to run. But they need to run AFTER these 5 Jobs finish/complete, because they are doing some DELETE... sql operations.

How can I have this second Stored Procedure (the one doing the DELETE queries) Job run AFTER my first 5 Stored Procedures Jobs when they are DONE? I don't want to set a CRON expression for the second Stored Procedure doing the DELETES, because I don't know what time the first 5 Stored Procs are even going to finish...

Below I included a little schematic of how the Jobs are currently triggered and how I want it to work (if possible): enter image description here


Solution

  • Preface: how I understand problem

    I hope that I understand the problem described by OP.

    If I was wrong then it makes everything below invalid.

    I suppose that it's about periodic night tasks heavy in CPU and/or IO.

    E.g:

    • there are tasks A-C for archiving data
    • maybe task D-E for rebuilding aggregates / refreshing mat views
    • and finally task F that runs reindexing/analyze on whole DB

    So it makes sense to run task F only after tasks A-E are finished.

    Every task is needed to be run just once in a period of time:

    • once in a day or hour or week or only during weekends in a night time
    • it's better not to run in a time when server is under load

    Does it fits with OP requirement - IDK.

    For the sake of simplicity let's presume that each task runs only once in a night. It's easy to extend for other periods/requirements.

    Data-driven approach

    1. Add log table

    E.g.

    CREATE TABLE job_log (
      log_id bigint,
      job_name text,
      log_date timestamptz
    ) 
    

    Tasks A-E

    On start

    For each job function do check:

    IF  EXISTS(
      SELECT 1 FROM job_log 
        WHERE
          job_name = 'TaskA' # TaskB-TaskE for each functiont
          AND log_date::DATE = NOW()::DATE # check that function already executed  this night
    ) OR  EXISTS(
       SELECT 1 FROM pg_stat_activity 
         WHERE 
           query like 'SELECT * FROM jobA_function();'  # check that job not executing right now
    ) THEN RETURN;
    END IF;
    

    It's possible that other conditions could be added: look for amount of connections, existence of locks and so on.

    This way it will be guaranteed that function will not be executed more frequently than needed.

    On finish

    INSERT INTO job_log
    SELECT
       (SELECT MAX(log_id) FROM job_log) + 1 # or use sequences/other autoincrements
      ,'TaskA'
      ,NOW()
    

    Cronjob schedule

    The meaning of it becames different.

    Now it's: "try to initiate execution of task".

    It's safe to schedule it for every hour between a chosen period or even more frequently.

    Cronjob cannot know if the server is under load or not, are there locks on a table or maybe somebody started execution of task manually.

    Job function could be more smart in that.

    Task F

    Same as above but check on start looks for completion of other tasks.

    E.g.

    IF NOT EXISTS(
      SELECT 1 FROM job_log 
         WHERE 
           job_name = 'TaskA'
           AND log_date::DATE = NOW()::DATE
    ) OR NOT EXISTS(  
      SELECT 1 FROM job_log 
         WHERE 
           job_name = 'TaskB'  
           AND log_date::DATE = NOW()::DATE
    )
    ....  # checks for completions of other tasks
    OR EXISTS(
      SELECT 1 FROM job_log 
        WHERE
          job_name = 'TaskF' # TaskB-TaskE for each functiont
          AND log_date::DATE = NOW()::DATE # check that function already executed  this night
    ) OR  EXISTS(
       SELECT 1 FROM pg_stat_activity 
         WHERE 
           query like 'SELECT * FROM jobF_function();'  # check that job not executing right now
    ) THEN RETURN;
    

    On completion

    Write to job_log the same as other functions.

    UPDATE. Cronjob schedule

    Create multiple schedule in cronjob.

    E.g.

    Let's say tasks A-E will run approximately 10-15 minutes.

    And it's possible that one or two of them could work for 30-45-60 minutes.

    Create a schedule for task F to attempt start every 5 minutes.

    How that will work:

    • attempt 1: task A finished, other still working -> exit
    • attempt 2: task A-C finished -> exit
    • attempt 3: tasks A-E finished -> start task F
    • attempt 4: tasks A-E finished but in pg_stat_activity there is an executing task F -> exit
    • attempt 5: tasks A-E finished, pg_stat_activity is empty but in logs we see that task F already executed -> no need to work -> exit
    • ... all other attempts will be the same till next night

    Summary

    It's easy extend this approach for any requirements:

    • another periodicity
    • or make it unperiodic at all. E.g. make a table with trigger and start execution on change
    • dependencies of any depth and/or "fuzzy" dependencies
    • ... literally everything

    Conception remains the same:

    • cronjob schedule means "try to run"
    • decision to run or not is data-driven

    I would be glad to hear criticism of any kind - who knows maybe I'm overlooking something.