Search code examples
sqlsql-serversql-server-2008-r2sql-order-bygreatest-n-per-group

SELECT 1 percent of distinct values


SQL Server 2008 R2

I have a table called Actions, this would be a snippet of what it looks like

ActionID | ActionType | ActionUserID | ActionDateTime
---------+------------+--------------+---------------------
555363     Open         9843           2020-09-15 09:27:55
555364     Process      2563           2020-09-15 09:31:22
555365     Close        8522           2020-09-15 09:37:48
555366     Detour       9843           2020-09-15 09:42:42
555367     Process      9843           2020-09-15 09:51:50
555368     Close        8522           2020-09-15 09:55:45
555369     Open         1685           2020-09-15 09:57:12
555370     Detour       2563           2020-09-15 10:03:23
555371     Detour       9843           2020-09-15 10:04:33
555372     Close        8522           2020-09-15 10:07:44

The table has hundreds of thousands of rows. What I want to do is review 1% of all actions performed by each user for a specific month.

I know I can get 1% of everything by doing:

SELECT TOP 1 PERCENT * 
FROM Actions 
WHERE ActionDateTime BETWEEN '09/01/2020' AND '09/30/2020' 
ORDER BY NEWID()

I know I can get 1% of a specific user by doing:

SELECT TOP 1 PERCENT * 
FROM Actions 
WHERE ActionUserID = 9843 
  AND ActionDateTime BETWEEN '09/01/2020' AND '09/30/2020' 
ORDER BY NEWID()

But what I really want to get is 1% of each user. I know I could get a list of the users who performed actions during the month by doing:

SELECT DISTINCT(ActionUserID) 
WHERE ActionDateTime BETWEEN '09/01/2020' AND '09/30/2020'

However I am not sure how to combine these two queries.


Solution

  • But what I really want to get is 1% of each user.

    I would recommend window function percent_rank():

    select *
    from (
        select a.*, percent_rank() over(partition by actionuserid order by newid()) prn
        from actions a
        where actiondatetime >= '20200901' and actiondatetime < '20201001'
    ) a
    where prn < 0.01
    

    If your version of SQL Server is so old that it does not support percent_rank(), then we can emulate it with rank() and count():

    select *
    from (
        select a.*, 
            rank() over(partition by actionuserid order by newid()) as rn,
            count(*) over(partition by actionuserid) as cnt
        from actions a
        where actiondatetime >= '20200901' and actiondatetime < '20201001'
    ) a
    where 100.0 * rn / cnt  < 1 or (rn = 1 and cnt < 100)