I have a table with 3 columns: user
(string), home
(string), num_events
(integer), like this:
user1,A,4
user2,B,5
user3,A,2
user4,C,12
A certain home can "contain" many users, but each user belongs to only 1 home.
I need to filter out all the first quartile of users with the lower num_events for each home. In order to do it, my plan is to list all users for each home, then order by num_events descending, and keep only the first 75%, then count the users for each home.
My attempts are similar to this:
WITH t AS (
SELECT home, user, num_events
FROM table
GROUP BY home_cusec, user, num_events
ORDER BY home_cusec, num_events DESC
)
SELECT home, COUNT(distinct user) FROM t
WHERE num_events > APPROX_PERCENTILE(num_events, 0.25)
GROUP BY home
ORDER BY home
But none of my attemts ave been successful. Please note that I am using AWS Athena, and the function TOP is not registered, so I cannot use
SELECT TOP 75 PERCENT * FROM t
I have tried with
LIMIT (SELECT ROUND( COUNT(*)*0.75 FROM t)
But also with this the query is not valid in AWS Athena...
Any ideas?
We can use ROW_NUMBER()
and COUNT()
here:
WITH cte AS (
SELECT *, ROW_NUMBER() OVER (ORDER BY num_events DESC) rn,
COUNT(*) OVER () AS cnt
FROM yourTable
)
SELECT user, home, num_events
FROM cte
WHERE 1.0*rn / cnt <= 0.75;