I have two tables. The 1st represent one article
per row and the 2nd represent one news_cluster
per row (with an array column for all articles_id
per cluster).
I need to get the 19 best news_cluster
and then get the 4 best articles for each news_cluster
(order by l_score column
).
Currently I'm fetching all articles_id
from all 19 news_cluster
with SQLAlchemy and then I sort them by news_cluster
and l_score
with python :
all_news_obj = News.query.order_by(score).limit(19)
all_ids = sum([a_news.articles_id for a_news in all_news_obj], [])
all_articles = Articles.query.filter(Articles.id.in_(all_ids)).all()
articles_sorted = 4_best_l_score_sorted_by_cluster(all_articles)
I was wondering if there was a faster way to fetch and sort these articles and I saw that you can group by cluster (news_id
) and get the top 4 l_score
for each group with this psql query:
SELECT ranked.* FROM
(SELECT articles.*, rank() OVER (PARTITION BY news_id ORDER BY l_score DESC)
FROM articles
WHERE id = ANY(ARRAY[209146, 209140, 209154...])
) ranked
WHERE rank <=4
So it might be faster if psql handle the sorting in the first place but I have no clue how to implement this query in SQLAlchemy.
My question is mostly answered here : Convert rank and partition query to SqlAlchemy
I just needed to adapt it to my Articles
model and add a filter to match the ids in all_ids_array
:
subquery = db.session.query(
Articles,
func.rank().over(
order_by=Articles.l_score.desc(),
partition_by=Articles.news_id
).label('rank')
).filter(Articles.id.in_(all_ids_array)).subquery()
articles_sorted = db.session.query(subquery).filter(subquery.c.rank <= 4)
I'm a happy man : It's 0.1 second faster than fetching all articles and sorting them afterward with python.