Search code examples
sqlpostgresqldatabase-designindexingpostgresql-performance

How to store and query version of same document in PostgreSQL?


I am storing versions of a document in PostgreSQL 9.4. Every time the user creates a new version, it inserts a row so that I can track all changes over time. Each row shares a reference_id column with the previous rows. Some of the rows get approved, and some remain as drafts. Each row also has a viewable_at time.

id | reference_id | approved | viewable_at         | created_on | content
1  | 1            | true     | 2015-07-15 00:00:00 | 2015-07-13 | Hello
2  | 1            | true     | 2015-07-15 11:00:00 | 2015-07-14 | Guten Tag
3  | 1            | false    | 2015-07-15 17:00:00 | 2015-07-15 | Grüß Gott

The most frequent query is to get the rows grouped by the reference_id where approved is true and viewable_at is less than the current time. (In this case, row id 2 would be included in the results)

So far, this is the best query I've come up with that doesn't require me to add additional columns:

SELECT DISTINCT ON (reference_id) reference_id, id, approved, viewable_at, content 
FROM documents 
WHERE approved = true AND viewable_at <= '2015-07-15 13:00:00' 
ORDER BY reference_id, created_at DESC`

I have an index on reference_id and a multi-column index on approved and viewable_at.

At only 15,000 rows it's still averaging a few hundred milliseconds (140 - 200) on my local machine. I suspect that the distinct call or the ordering may be slowing it down.

What is the most efficient way to store this information so that SELECT queries are the most performant?

Result of EXPLAIN (BUFFERS, ANALYZE):

                                                              QUERY PLAN                                                                
-----------------------------------------------------------------------------------------------------------------------------------------
Unique  (cost=6668.86..6730.36 rows=144 width=541) (actual time=89.862..99.613 rows=145 loops=1)
  Buffers: shared hit=2651, temp read=938 written=938
  ->  Sort  (cost=6668.86..6699.61 rows=12300 width=541) (actual time=89.861..97.796 rows=13184 loops=1)
        Sort Key: reference_id, created_at
        Sort Method: external merge  Disk: 7488kB
        Buffers: shared hit=2651, temp read=938 written=938
        ->  Seq Scan on documents  (cost=0.00..2847.80 rows=12300 width=541) (actual time=0.049..40.579 rows=13184 loops=1)
              Filter: (approved AND (viewable_at < '2015-07-20 06:46:55.222798'::timestamp without time zone))
              Rows Removed by Filter: 2560
              Buffers: shared hit=2651
Planning time: 0.218 ms
Execution time: 178.583 ms
(12 rows)

Document Usage Notes:

The documents are manually edited and we're not yet autosaving the documents every X seconds or anything, so the volume will be reasonably low. At this point, there is an average of 7 versions and an average of only 2 approved versions per reference_id. (~30%)

On the min and max side, the vast majority of documents will have 1 or 2 versions and it seems unlikely that any document would have more than 30 or 40. There is a garbage collection process to clean out unapproved versions older than a week, so the total number of versions should stay pretty low.

For retrieving and practical usage, I could use limit / offset on the queries but in my tests that doesn't make a huge difference. Ideally this is a base query that populates a view or something so that I can do additional queries on top of these results, but I'm not entirely sure how that would affect the resulting performance and am open to suggestions. My impression is that if I can get this storage / query as simple / fast as possible then all other queries that start from this point could be improved, but it's likely that I'm wrong and that each query needs more independent thought.


Solution

  • @Craig already covers most options to make this query faster. More work_mem for the session is probably the most effective item.

    Since:

    There is a garbage collection process to clean out unapproved versions older than a week

    A partial index excluding unapproved versions won't amount to much. If you use an index, you would still exclude those irrelevant rows, though.
    Since you seem to have very few versions per reference_id:

    the vast majority of documents will have 1 or 2 versions

    You already have the best query technique with DISTINCT ON:

    With a growing number of versions, other techniques would be increasingly superior:

    The only slightly unconventional element in your query is that the predicate is on viewable_at, but you then take the row with the latest created_at, which is why your index would actually be:

    (reference_id, viewable_at ASC, created_at DESC) WHERE (approved)

    Assuming all columns to be defined NOT NULL. The alternating sort order between viewable_at and created_at is important. Then again, while you have so few rows per reference_id I don't expect any index to be of much use. The whole table has to be read anyway, a sequential scan will be about as fast. The added maintenance cost of the index may even outweigh its benefit.

    However, since:

    Ideally this is a base query that populates a view or something so that I can do additional queries on top of these results

    I have one more suggestion: Create a MATERIALIZED VIEW from your query, giving you a snapshot of your project for the given point in time. If disk space is not an issue and snapshot might be reused, you might even collect a couple of those to stick around:

    CREATE MATERIALIZED VIEW doc_20150715_1300 AS
    SELECT DISTINCT ON (reference_id)
           reference_id, id, approved, viewable_at, content 
    FROM   documents 
    WHERE  approved  -- simpler expression for boolean column
    AND    viewable_at <= '2015-07-15 13:00:00' 
    ORDER  BY reference_id, created_at DESC;
    

    Or, if all additional queries happen in the same session, use a temp table instead (which dies at the end of the session automatically):

    CREATE TEMP TABLE doc_20150715_1300 AS ...;
    
    ANALYZE doc_20150715_1300;
    

    Be sure to run ANALYZE on the temp table (and also on the MV if you run queries immediately after creating it):

    Either way, it may pay to create one or more indexes on the snapshots supporting subsequent queries. Depends on data and queries.

    Note, the current version 1.20.0 of pgAdmin does not display indexes for MVs. That's already been fixed and is waiting to be released with the next version.