Search code examples
sqlamazon-athenaprestopercentiletrino

How to select top 75% of rows using SQL


I have a table with 3 columns: user (string), home (string), num_events (integer), like this:

user1,A,4
user2,B,5
user3,A,2
user4,C,12

A certain home can "contain" many users, but each user belongs to only 1 home.

I need to filter out all the first quartile of users with the lower num_events for each home. In order to do it, my plan is to list all users for each home, then order by num_events descending, and keep only the first 75%, then count the users for each home.

My attempts are similar to this:

WITH t AS (
SELECT home, user, num_events
FROM table 
GROUP BY home_cusec, user, num_events
ORDER BY home_cusec, num_events DESC
)

SELECT home, COUNT(distinct user) FROM t
WHERE num_events > APPROX_PERCENTILE(num_events, 0.25)
GROUP BY home
ORDER BY home

But none of my attemts ave been successful. Please note that I am using AWS Athena, and the function TOP is not registered, so I cannot use

SELECT TOP 75 PERCENT * FROM t 

I have tried with

LIMIT (SELECT ROUND( COUNT(*)*0.75 FROM t)

But also with this the query is not valid in AWS Athena...

Any ideas?


Solution

  • We can use ROW_NUMBER() and COUNT() here:

    WITH cte AS (
        SELECT *, ROW_NUMBER() OVER (ORDER BY num_events DESC) rn,
                  COUNT(*) OVER () AS cnt
        FROM yourTable
    )
    
    SELECT user, home, num_events
    FROM cte
    WHERE 1.0*rn / cnt <= 0.75;