Search code examples
wordpresscronrating

Run a DB-intensive query/calculation asynchronously


This question relates to WordPress's wp-cron function but is general enough to apply to any DB-intensive calculation.

I'm creating a site theme that needs to calculate a time-decaying rating for all content in the system at regular intervals. This rating determines the order of posts on the homepage, which is paged to allow visitors to potentially view all content. This rating value needs to be calculated frequently to make sure the site has fresh content listed in the proper order.

The rating calculation is not heavy but the rating needs to be calculated for, potentially, 1,000s of items and doing that hourly via wp-cron will start to cause problems for sites with lots of content. Ignoring the impact on page load (wp-cron processes requests on page loads once a certain interval has been reached), at some point the script will reach a time limit. Setting up the site to use "plain ol' cron" will solve the page loading issue but not the timeout one.

Assuming that I have no control over the sites that this will run on, what's the best way to handle this rating calculation on a regular basis? A few things that came to mind:

  • Only calculate the rating for the most recent 1,000 posts, assuming that the rest won't be seen much. I don't like the idea of ignoring all old content, though.
  • Calculate the first, say, 100 or so, then only calculate the rating for older groups if those pages are loaded. This might be hard to get right, though, and lead to incorrect listing and ratings (which isn't a huge problem for older content but something I'd like to avoid)
  • Batch process 100 or so at regular intervals, keeping track of the last one processed. This would cycle through the whole body of content eventually.

Any other ideas? Thanks in advance!


Solution

  • Depending on the host, you're in for a potentially sticky situation. Let me outline a couple of ideal cases and you can pick/choose where you need to.

    Option 1

    Mirror the database first and use a secondary app (WordPress or otherwise) to do the calculations asynchronously against that DB mirror. When they're done, they can update a static file in the project root, write data to a shared Memcached instance, trigger a POST to WordPress' admin_post endpoint to write some internal state, whatever.

    The idea here is that you're removing your active site from the equation. The last thing you want to do is have a costly cron job lock the live site's database or cause queries to slow down as it does its indexing.

    Option 2

    Offload the calculation entirely to a separate application. Tracking ratings in real time with WordPress is a poor idea as it bypasses page caching and triggers an uncachable request every time a new rating comes in. Pushing this off to a second server means your WordPress site is super fast, and it also means you can have the second server do the calculations for you in the first place.

    If you're already using something like Elastic Search on the site, you can add ratings as an added indexing facet. Then just update posts as ratings change, and use the ES API to query most popular posts later.

    Alternatively, you can use a hosted service like Keen IO to record and aggregate ratings.

    Option 3

    Still use cron, but don't schedule it as a cron job in WordPress. Instead, write a WP CLI routine that does the reindexing for you. Then, schedule real cron jobs to process the job.

    This has the advantage of using PHP's command line version, which can be configured to skip the timeouts and memory limits imposed on the FPM/CGI/whatever version used to serve the site. It also means you don't have to wait for site traffic to trigger the job - and a long-running job won't block other cron events within WordPress from firing.

    If using this process, I would set the job to run hourly and, each hour, run a batch of 1/24th of the total posts in the database. You can keep track of offsets or even processed post IDs in the database, the point is just that you're silently re-indexing posts throughout the day.