Search code examples
postgresqlsqlalchemysubquery

joining with a DISTINCT ON on an ordered subquery in sqlalchemy


Here is (an extremely simplified version of) my problem.

I'm using Postgresql as the backend and trying to build a sqlalchemy query from another query.

Table setup

Here are the tables with some random data for the example. You can assume that each table was declared in sqlalchemy declaratively, with the name of the mappers being respectively Item and ItemVersion. At the end of the question you can find a link where I put the code for everything in this question, including the table definitions.

Some items.

item
+----+
| id |
+----+
|  1 |
|  2 |
|  3 |
+----+

A table containing versions of each item. Each has at least one.

item_version
+----+---------+---------+-----------+
| id | item_id | version | text      |
+----+---------+---------+-----------+
|  1 |       1 |       0 | item_1_v0 |
|  2 |       1 |       1 | item_1_v1 |
|  3 |       2 |       0 | item_2_v0 |
|  4 |       3 |       0 | item_3_v0 |
+----+---------+---------+-----------+

The query

Now, for a given sqlalchemy query over Item, I want a function that returns another query, but this time over (Item, ItemVersion), where the Items are the same as in the original query (and in the same order!), and where the ItemVersion are the corresponding latest versions for each Item.

Here is an example in SQL, which is pretty straightforward:

First a random query over the item table

SELECT item.id as item_id
FROM item
WHERE item.id != 2
ORDER BY item.id DESC

which corresponds to

+---------+
| item_id |
+---------+
|       3 |
|       1 |
+---------+

Then from that query, if I want to join the right versions, I can do

SELECT sq2.item_id AS item_id,
       sq2.item_version_id AS item_version_id,
       sq2.item_version_text AS item_version_text
FROM (
    SELECT DISTINCT ON (sq.item_id)
           sq.item_id AS item_id,
           iv.id AS item_version_id,
           iv.text AS item_version_text
    FROM (
        SELECT item.id AS item_id
        FROM item
        WHERE id != 2
        ORDER BY id DESC) AS sq
    JOIN item_version AS iv
      ON iv.item_id = sq.item_id
    ORDER BY sq.item_id, iv.version DESC) AS sq2
ORDER BY sq2.item_id DESC

Note that it has to be wrapped in a subquery a second time because the DISTINCT ON discards the ordering.

Now the challenge is to write a function that does that in sqlalchemy. Here is what I have so far.

First the initial sqlalchemy query over the items:

session.query(Item).filter(Item.id != 2).order_by(desc(Item.id))

Then I'm able to build my second query but without the original ordering. In other words I don't know how to do the second subquery wrapping that I did in SQL to get back the ordering that was discarded by the DISTINCT ON.

def join_version(session, query):
    sq = aliased(Item, query.subquery('sq'))

    sq2 = session.query(sq, ItemVersion) \
        .distinct(sq.id) \
        .join(ItemVersion) \
        .order_by(sq.id, desc(ItemVersion.version))
    return sq2

I think this SO question could be part of the answer but I'm not quite sure how.

The code to run everything in this question (database creation, population and a failing unit test with what I have so far) can be found here. Normally if you can fix the join_version function, it should make the test pass!


Solution

  • Ok so I found a way. It's a bit of a hack but still only queries the database twice so I guess I will survive! Basically I'm querying the database for the Items first, and then I do another query for the ItemVersions, filtering on item_id, and then reordering with a trick I found here (this is also relevant).

    Here is the code:

    def join_version(session, query):                                  
        items = query.all()                                            
        item_ids = [i.id for i in items]                               
        items_v_sq = session.query(ItemVersion) \                      
            .distinct(ItemVersion.item_id) \                           
            .filter(ItemVersion.item_id.in_(item_ids)) \               
            .order_by(ItemVersion.item_id, desc(ItemVersion.version)) \
            .subquery('sq')                                            
        sq = aliased(ItemVersion, items_v_sq)                          
        items_v = session.query(sq) \                                  
            .order_by('idx(array{}, sq.item_id)'.format(item_ids))     
                                                                       
        return zip(items, items_v)