Search code examples
pythonpostgresqlsqlalchemygreatest-n-per-group

How to get top n results per group from a pool of ids in SQLAlchemy?


I have two tables. The 1st represent one article per row and the 2nd represent one news_cluster per row (with an array column for all articles_id per cluster).

I need to get the 19 best news_cluster and then get the 4 best articles for each news_cluster (order by l_score column).

Currently I'm fetching all articles_id from all 19 news_cluster with SQLAlchemy and then I sort them by news_cluster and l_score with python :

all_news_obj = News.query.order_by(score).limit(19)
all_ids = sum([a_news.articles_id for a_news in all_news_obj], [])
all_articles = Articles.query.filter(Articles.id.in_(all_ids)).all()
articles_sorted = 4_best_l_score_sorted_by_cluster(all_articles)

I was wondering if there was a faster way to fetch and sort these articles and I saw that you can group by cluster (news_id) and get the top 4 l_score for each group with this psql query:

SELECT ranked.* FROM 
(SELECT articles.*, rank() OVER (PARTITION BY news_id ORDER BY l_score DESC)
    FROM articles
    WHERE id = ANY(ARRAY[209146, 209140, 209154...])
) ranked
WHERE rank <=4

So it might be faster if psql handle the sorting in the first place but I have no clue how to implement this query in SQLAlchemy.


Solution

  • My question is mostly answered here : Convert rank and partition query to SqlAlchemy

    I just needed to adapt it to my Articles model and add a filter to match the ids in all_ids_array :

    subquery = db.session.query(
        Articles,
        func.rank().over(
            order_by=Articles.l_score.desc(),
            partition_by=Articles.news_id
        ).label('rank')
    ).filter(Articles.id.in_(all_ids_array)).subquery()
    
    articles_sorted = db.session.query(subquery).filter(subquery.c.rank <= 4)
    

    I'm a happy man : It's 0.1 second faster than fetching all articles and sorting them afterward with python.