Search code examples
sqlsql-serverpartition-by

Performing complex data partitioning when using PARTITION BY in SQL Server


I am not a SQL expert, not even close. In the below example am using the PARTITION BY clause in a PERCENT_RANK() operation (Microsoft SQL Server hosted in Azure) to group my data for ranking which works as expected:

DECLARE @Dinky TABLE
(
    [Id] INT PRIMARY KEY NOT NULL,
    [Name] nvarchar(32) NOT NULL,
    [Score] float NOT NULL,
    [Type] nvarchar(4) NOT NULL,
    [Pool] int NOT NULL
);

INSERT INTO @Dinky ([Id], [Name], [Score], [Type], [Pool]) VALUES (1, 'Bob',   1.4, 'A', 1);
INSERT INTO @Dinky ([Id], [Name], [Score], [Type], [Pool]) VALUES (2, 'Mary',  2.6, 'A', 2);
INSERT INTO @Dinky ([Id], [Name], [Score], [Type], [Pool]) VALUES (3, 'Kim',   7.8, 'A', 3);
INSERT INTO @Dinky ([Id], [Name], [Score], [Type], [Pool]) VALUES (4, 'Chris', 3.3, 'A', 1);
INSERT INTO @Dinky ([Id], [Name], [Score], [Type], [Pool]) VALUES (5, 'Linda', 4.5, 'A', 1);
INSERT INTO @Dinky ([Id], [Name], [Score], [Type], [Pool]) VALUES (6, 'Frank', 2.1, 'A', 2);
INSERT INTO @Dinky ([Id], [Name], [Score], [Type], [Pool]) VALUES (7, 'Julie', 1.0, 'A', 3);
INSERT INTO @Dinky ([Id], [Name], [Score], [Type], [Pool]) VALUES (8, 'Greg',  6.7, 'A', 3);
INSERT INTO @Dinky ([Id], [Name], [Score], [Type], [Pool]) VALUES (9, 'Tom',   8.3, 'A', 2);

SELECT [Id]
      ,[Name]
      ,[Score]
      ,[Type]
      ,[Pool]
      ,PERCENT_RANK() OVER (PARTITION BY [Type], [Pool] ORDER BY [Score] DESC) [Rank]

FROM @Dinky

This will produce the expected output with my rows being partitioned first by their [Type] (which I realize are all the same in this example) and then by their respective [Pool]:

Id  Name   Score  Type  Pool  Rank
5   Linda  4.5     A     1    0
4   Chris  3.3     A     1    0.5
1   Bob    1.4     A     1    1
9   Tom    8.3     A     2    0
2   Mary   2.6     A     2    0.5
6   Frank  2.1     A     2    1
3   Kim    7.8     A     3    0
8   Greg   6.7     A     3    0.5
7   Julie  1       A     3    1

So far, so good - but of course it can't be that simple. In my particular business case, the rows in Pool 1 should be ranked along side all the rows in Pools 2 and 3 (think of it like a super-pool of everyone); further, the rows in Pool 2 should be ranked along with the rows in Pool 3 (a more exclusive pool); and of course the rows in Pool 3 are ranked only among themselves (the elite). Putting aside the question of 'how to express this' for a moment, the other part of the problem is that of duplicate rows. I only want to retain the rank result for the highest pool for each row. So in this example, Greg would be used for ranking in the 1-2-3 pool, and used for ranking in the 2-3 pool, and used for ranking in the 3 pool - but I only want his result for the 3 pool in my output.

After some fiddling around I was eventually able to get my desired result, but (I think) the result is far from elegant. I can't help but feel there is a better way to express this in SQL that I'm simply not fluent enough to produce. Here is what I came up with:

WITH poolOne AS
(
    SELECT [Id]
          ,[Name]
          ,[Score]
          ,[Type]
          ,[Pool]
          ,PERCENT_RANK() OVER (PARTITION BY [Type] ORDER BY [Score] DESC) [Rank]
    FROM @Dinky
    WHERE [Pool] >= 1
), poolTwo AS 
(
    SELECT [Id]
          ,[Name]
          ,[Score]
          ,[Type]
          ,[Pool]
          ,PERCENT_RANK() OVER (PARTITION BY [Type] ORDER BY [Score] DESC) [Rank]
    FROM @Dinky
    WHERE [Pool] >= 2
), poolThree AS 
(
    SELECT [Id]
          ,[Name]
          ,[Score]
          ,[Type]
          ,[Pool]
          ,PERCENT_RANK() OVER (PARTITION BY [Type] ORDER BY [Score] DESC) [Rank]
    FROM @Dinky
    WHERE [Pool] >= 3
)
SELECT * 
FROM poolThree
UNION 
SELECT * 
FROM poolTwo
WHERE NOT EXISTS (SELECT 1 FROM poolThree 
                  WHERE poolTwo.Id = poolThree.Id) 
UNION 
SELECT *
FROM poolOne
WHERE NOT EXISTS (SELECT 1 FROM poolTwo 
                  WHERE poolOne.Id = poolTwo.Id)

I get this output:

Id  Name   Score  Type  Pool  Rank
5   Linda   4.5    A     1    0.375
4   Chris   3.3    A     1    0.5
1   Bob     1.4    A     1    0.875
9   Tom     8.3    A     2    0
2   Mary    2.6    A     2    0.6
6   Frank   2.1    A     2    0.8
3   Kim     7.8    A     3    0
8   Greg    6.7    A     3    0.5
7   Julie   1      A     3    1

Notice Kim, Greg and Julie are the same as they were in my original query because they are exclusively ranked within their pool (Pool 3). But the ranking for Tom, Mary and Frank have slightly changed because they were ranked with their pool (Pool 2) and the more exclusive pool (Pool 3). Finally, the rankings for Linda, Chris and Bob were ranked against everyone (Pools 1, 2 and 3).

I realize in my example above I could rewrite the WITH poolOne to remove the WHERE clause, and rewrite the WITH pool3 to remove the WHERE and change the PARTITION BY to it's original form. But for what it is (all things being equal), I think the logic is more clearly expressed the way it is (and is certainly better for an example asking for help).

What could I do to rewrite this in a more elegant, shorter, better, more performant, (pick your adjective) way? Can someone please dazzle me with their sql-foo?


Solution

  • I don't know if there's a way to avoid using PERCENT_RANK() multiple times, but one way to do achieve this result without the unions would be with a CASE expression.

    SELECT [Id]
      ,[Name]
      ,[Score]
      ,[Type]
      ,[Pool]
      ,[Rank] = 
       CASE [Pool]
           WHEN 3 THEN PERCENT_RANK() OVER (PARTITION BY [Type], [Pool] ORDER BY [Score] DESC)
           WHEN 2 THEN PERCENT_RANK() OVER (PARTITION BY [Type], CASE WHEN [Pool] >= 2 THEN 1 END ORDER BY [Score] DESC)
           WHEN 1 THEN PERCENT_RANK() OVER (PARTITION BY [Type] ORDER BY [Score] DESC)
       END
    FROM @Dinky
    ORDER BY [Type], [Pool], [Score] DESC;