Traverse a big table in PostgreSQL with Propel

I recently had a task to iterate over a big table (~40KK records) in PostgreSQL using Propel and encountered performance issues, both memory limit and execution speed. My script had been running for 22(!) hours.

The task was to retrieve records based on some criteria (not active for the last 6 months) and archive them (move to another table) and all related entities from other tables.

The primary table, my script is working on, has several columns: id, device_id, application_id, last_activity_date and others, that don’t have any significant meaning here. This table contains information about applications installed on device and their last activity dates. There may be several records with the same device_id and different application_id. Here is a sample from the table:

     id    | device_id | application_id |   last_init_date
 ----------+-----------+----------------+---------------------
         1 |         1 |              1 | 2013-09-24 17:09:01
         2 |         1 |              2 | 2013-09-19 20:36:23
         3 |         1 |              3 | 2014-02-11 00:00:00
         4 |         2 |              4 | 2013-09-29 20:12:54
         5 |         3 |              5 | 2013-08-31 19:41:05

So, the device is considered to be old enough to be archived, if the maximum last_activity_date for the particular device_id in this table is older than 6 months. Here is the query:

SELECT device_id
FROM device_applications
GROUP BY device_id
HAVING MAX(last_init_date) < '2014-06-16 08:00:00'

In Propel it looks like:

\DeviceApplicationsQuery::create()
  ->select('DeviceId')
  ->groupByDeviceId()
  ->having('MAX(device_applications.LAST_INIT_DATE) < ?', $date->format('Y-m-d H:i:s'))
  ->find();

The resulting set, as you understand, is too big to fit in memory, so I have to split it somehow into chunks.

The question is: what is the best strategy to choose in this situation to decrease memory consumption and to speed up the script? In my answer I'll show you what I've found so far.

Solution

I know three strategies of traversing a big table.

1. Good old limit/offset

The problem with this approach is that the database actually examines the records, that you want to skip with OFFSET. Here is a quote from the doc:

The rows skipped by an OFFSET clause still have to be computed inside the server; therefore a large > OFFSET might be inefficient.

Here is a simple example (not my initial query):

explain (analyze)
SELECT *
FROM device_applications
ORDER BY device_id
LIMIT 100
OFFSET 300;

Execution plan:

Limit  (cost=37.93..50.57 rows=100 width=264) (actual time=0.630..0.835 rows=100 loops=1)
    ->  Index Scan using device_applications_device_id_application_id_unique on device_applications  (cost=0.00..5315569.97 rows=42043256 width=264) (actual time=0.036..0.806 rows=400 loops=1)
Total runtime: 0.873 ms

Pay special attention to the actual results in Index scan section. It shows, that PostgreSQL worked with 400 records, which is offset (300) plus limit (100). So this approach is quite inefficient, especially taking into consideration the complexity of the initial query.

2. Ranging by some column

We can avoid the limitations of the limit/offset approach by making the query work with ranges of the table, which are made by slicing the table by a column.

To clarify, let’s imaging you have a table with 100 records, you can divide this table into five ranges by 20 records in each: 0 - 20, 20 - 40, 40 - 60, 60 - 80, 80 - 100, and then work with the smaller subsets. In my case the column we can range by is device_id. The query looks like this:

SELECT device_id
FROM device_applications
WHERE device_id >= 1 AND device_id < 1000
GROUP BY device_id
HAVING MAX(last_init_date) < '2014-06-16 08:00:00';

It groups records by device_id, extracts the range and applies the condition on last_init_date. Of course, it may be (and will be in most cases) that there will be no records matching the condition. So, the problem with this approach is that you have to scan the whole table, even if the records you want to find are just 5% of all the records.

3. Using cursors

What we need is a cursor. Cursors allow to iterate over the result set without fetch the whole data at once. In PHP you make use of cursors when you iterate over a PDOStatement. A simple example:

$stmt = $dbh->prepare("SELECT * FROM table");
$stmt->execute();

// Iterate over statement using a cursor
foreach ($stmt as $row) {
    // Do something
}

In Propel you can make use of this PDO's feature with a PropelOnDemandFormatter class. So, the final code:

$devApps = \DeviceApplicationsQuery::create()
  ->setFormatter('\PropelOnDemandFormatter')
  ->select('DeviceId')
  ->groupByDeviceId()
  ->having('MAX(device_applications.LAST_INIT_DATE) < ?', $date->format('Y-m-d H:i:s'))
  ->find();

/** @var \DeviceApplications $devApp */
foreach ($devApps as $devApp) {
    // Do something
}

Here the call to find() will not fetch the data, but instead will create a collection with on demand object creation.