Search code examples
hiveapache-spark-sqlhiveqltop-n

Top N sorted rows with GROUP BY?


I have the following transaction table:

customer_id, category, product_id, score

I'm grouping by customer_id and category to create a list of product_id-score map pairs:

SELECT
    s.customer_id,    
    s.category,
    collect_list(s.pair) 
FROM
    (
        SELECT
            customer_id,
            category,
            map(product_id, score) AS pair
        FROM
            transaction
        WHERE
            score > {score_threshold}
    ) s 
GROUP BY
    s.customer_id,
    s.category

Now I want to take this a step further. For each group, I'm looking to retain just the top n pairs, sorted by score (descending order).I tried the OVER (PARTITION BY...ORDER BY) by I'm running into problems.

Note: the transaction table is partitioned by category

Thanks


Solution

  • Try this:

    SELECT
            s.customer_id,    
            s.category,
            collect_list(s.pair) 
        FROM
            (
                SELECT
                    ROW_NUMBER() OVER (PARTITION BY customer_id, category ORDER BY score desc) as RowId 
                    customer_id,
                    category,
                    map(product_id, score) AS pair
                FROM
                    transaction
                WHERE
                    score > {score_threshold}
            ) s 
            where s.RowId < n
        
        GROUP BY
            s.customer_id,
            s.category