Search code examples
sqlsumgreatest-n-per-groupprestoamazon-athena

How to get top N rows with some conditions


I have a query something like this:

SELECT product_id,
         site,
         category_id,
         session_time,
         sum(cast(coalesce("#clicks",
         0) AS bigint)) AS clicks
FROM df
WHERE site IN ('com', 'co')
        AND session_time = DATE('2020-02-27')
GROUP BY  product_id, site, session_time, category_id
ORDER BY clicks desc
LIMIT 10

But now, I want to see the top 10 product_id for each site and category_id based on the clicks. When I write the LIMIT function, it only shows the top 10 products but it does not group it by category_id and shop_id.

How can I do this?


Solution

  • Use window functions. You can RANK() records by descending clicks within site/category partitions in a subquery, and then filter in the outer query:

    SELECT *
    FROM (
        SELECT 
            product_id,
            site,
            category_id,
            session_time,
            SUM("#clicks") clicks,
            RANK() OVER(PARTITION BY site, category_id ORDER BY sum("#clicks") DESC) rn
        FROM df
        WHERE 
            site IN ('com', 'co')
            AND session_time = DATE('2020-02-27')
        GROUP BY  product_id, site, session_time, category_id
    ) t
    WHERE rn <= 10
    ORDER BY site, category, clicks desc
    

    I am unclear on why you need the coalesce()/cast() logic in the sum() (just like other aggregate functions, sum() ignore null values, and it seems like #clicks is a number already), so I removed it - you can add it back if you do need it, for some reason that I could not think of.