Search code examples
cronmonitoringserverless

How do I easily extend the scope of my CRON job, monitor and retry if a part/all of the job fails?


I run a website that allows users to book ferry tickets. We keep local inventory of ferry schedules in our DB. The way it works is that we have code that fetches schedule data from ferry operators' systems via API calls and the fetched data is then parsed, the code then does some checks with our local inventory in our DB i.e it reads data from our schedules DB and finally updates (writes to) the DB with updated/new schedules that have been fetched and parsed.

Now, this happens numerous times during the day. We fetch schedules in slots (what I'm referring to as the CRON) at recurring intervals during the day (as schedules on the operator's end can change multiple times during the day) in the following manner:

Day 1-5 (1 hr time interval)

Day 6-10 (1 hr time interval)

Day 11-15 (1 hr time interval)

Day 16-20 (2 hrs time interval)

Day 20-30 (2 hrs time interval)

Day 31-40 (4 hrs time interval)

Day 41-50 (4 hrs time interval)

Day 51-60 (4 hrs time interval)

Day 61-70 (8 hrs time interval)

Day 71-80 (8 hrs time interval)

Day 81-90 (8 hrs time interval)

Day 91-100 (8 hrs time interval)

Day 101-110 (8 hrs time interval)

Day 111-120 (8 hrs time interval)

Day 121-130 (8 hrs time interval)

Day 131-140 (8 hrs time interval)

Day 141-150 (8 hrs time interval)

Day 151-160 (8 hrs time interval)

Day 161-170 (8 hrs time interval)

Day 171-180 (8 hrs time interval)

I want to achieve the following:

  1. I want to extend this CRON to fetch data for 365 days (this data i.e 180-365 days isn't updated as frequently, what would be a good interval to run this at?).
  2. The CRON's execution takes up quite a bit of resources (RAM and CPU) on our server (It's an 8GB, 2 vCPU DigitalOcean droplet i.e a shared server hosts the site and the DB).
  3. At this moment, I am unable to verify if the CRON successfully works i.e I'm unsure if all the updates to the DB that are to be made after fetching fresh schedules data from the operators is actually being updated in our DB to reflect the daily schedules. And if the fetching-schedules part or the updating-the-DB part of the CRON fails, I want to know which part failed, when it failed and I want the server to retry it. So some kind of monitoring and retrying basically.

Other Constraints:

  1. It may be important to note that the operator APIs tend to be a bit unreliable and give out slow responses from time to time.
  2. Quite a lot of data needs to be fetched, parsed and written to the DB (due to multiple schedules per day with multiple parameters changing)

Ultimately, my goal is to maintain a local copy of the operators' schedules (as close to realtime as possible) without having the CRON take up too many resources. I also want the CRON to ideally not fail and if it does fail, I want to be notified about it and I want auto retries for a certain number of times. And I want the CRON to be separate from my server and be modular in a sense.

I'm open to both custom solutions or ideally a 3rd party service that does this well but I'm optimizing for simplicity and cost here.

I was thinking of the following approaches (Please correct me if I'm wrong):

  1. Splitting the code into 2 parts:
    1. The part that fetches schedules from the operator APIs and stores it in one table, say table1.

    2. The part that parses the data in table1 and reads and writes to my main schedules table.

  2. Get another digitalocean droplet (shared server) that executes the above 2 parts

Offloading this service off my website's main server should free up resources solving that resource usage problem. However, I'm still left with the problem of not being able to verify if all my updates are actually taking place i.e if schedules are only getting partially updated or fully updated or knowing when and where exactly in my code are failures occurring.


Solution

  • Decided on the following approach:

    Split the CRON service into 2 parts:

    1. Part 1 fetches schedules and stores it in a table, let's call it fetched schedules in a managed DB on DigitalOcean.
    2. Part 2 reads from fetched_schedules, reads another table, say live_schedules from the production DB, parses this data, performs some logical operations and writes updated schedules to the live_schedules table.

    Using serverless functions:

    1. Part 1 and Part 2 will essentially be 2 vanilla-PHP serverless functions on DigitalOcean.
    2. As an example: First, day 1-5 schedules are fetched and written to fetched_schedules followed by simultaneous execution of fetching of Day 6-10 schedules and Part2's 1-5 schedule parsing is done so that updates can be finally written to live_schedules

    Why this approach?

    1. If Part 1 fails, I have logs provided by DO on what slot that I was fetching schedules for, failed.
    2. If Part 2 fails, I have logs provided by DO on what slot that I was parsing and writing data to live_schedules for, failed.
    3. I can share these logs (or they can be auto-shared) with my non-technical team for them to manually trigger a Part1 slot specifically as a retry mechanism.
    4. 2 serverless functions running on low-powered servers from DO on say 128MB / 256MB of RAM even with reasonably high request processing time seem to be quite affordable.