Search code examples
sqlpostgresqldatetimeinner-joingreatest-n-per-group

How can I query the latest version of a page in SQL, when that information is stored in a secondary table?


Consider that I'm writing a wiki¹. I may have one table that contains a row for each wiki page, and another that contains each version of that page, with a foreign key to the page that the version corresponds to. A user may request to view a list of every page, including the title of the page (which is included in the versions table since the title can be updated and thus should be tracked withversions).

I could first do a query to get a list of wiki pages, and then do a separate query to get the title of each page, but this number of queries seems like it runs many more queries than I need, and is thus less performant due to server round trips, and some (very minor) blocking in the SQL library.

Instead, I'd rather do something like a JOIN between the wiki pages table and the versions table, but then I'll get a separate row in the result for each version, transferring and preparing lot more data than I need. In my query to view a page's contents, I just use ORDER BY timestamp DESC LIMIT 1, which works great there to solve this problem, but this won't work as-is for a list case since I need more than one row. Can I make the order by and limit apply separately to each set of rows that share a page id?

My next idea is to try something with subqueries, and this is all that my research attempts point to, to essentially do my first option but where the Postgres' optimizer can see the entire operation at once and hopefully optimize it more than with many queries, and to avoid more round trips and blocking, but when I looked at Postgres' list of available subquery options, I was unable to figure out how to use any of them to solve this problem.

Lastly, I could just store the title (and other per-version data that I need in this query) in the main table, but this is duplication of data and thus a bad practice. Nonetheless, it seems like the least evil that I can figure out at present; hence, the question: How can I query the data that I need, to produce a list of wiki pages including the latest per-version data in a performant manner and without duplicating data?

1: My project isn't a wiki, but as the details of it are private for now, I need to give a slightly contrived example.


Solution

  • You are describing a top-1-per-group problem. Without seeing actual structures this is rather theoritical, but the logic could be implemented with distinct on in Postgres. That would look something like this:

    select distinct on (p.page_id) p.*, pv.title
    from pages p
    inner join page_versions pv on pv.page_id = p.page_id
    order by p.page_id, pv.timestamp desc
    

    Or you could use a lateral join:

    select p.*, pv.title
    from pages p
    cross join lateral (
        select pv.*
        from page_versions pv
        where pv.page_id = p.page_id
        order by pv.timestamp desc limit 1
    ) pv