sql sql-server performance t-sql sql-server-2022

Find corresponding records over multiple columns for each key value given a sort order

I'm looking for a (T-)SQL solution to the following question. I have a table with columns

Key (int, not null),
Column1 (varchar(30), null), Column2, Column3 (up to Column9 but for brevity I'll stop at 3),
SortOrder (int, not null)

Values in the Key column may be present in multiple rows. As a result set I need a row for each Key with values in the Column1-Column3 taken from potentially different rows in the initial data, where ColumnX value is not null and ordered by SortOrder. Hopefully it's easier to understand this from the example:

Key	Column1	Column2	Column3	SortOrder
Key1	C1	NULL	NULL	1
Key1	C10	C11	NULL	2
Key1	C20	C21	C22	3
Key2	NULL	C30	NULL	1
Key2	C40	C41	NULL	2

The result set in this case would be

Key	Column1	Column2	Column3
Key1	C1	C11	C22
Key2	C40	C30	NULL

There are 100K-200K rows in the table so I'm looking for a good performing solution. Ideally one which reads the table only once but I'm not sure it's even possible (I'll gladly take analytical functions if they help performance or readability in this case). Suggestions for indexing are also welcome though I assume it will be by (Key, SortOrder) anyway.

I have a query which returns the desired result but I'm not quite happy with readability and especially performance:

WITH Keys AS (SELECT Key FROM MyTable GROUP BY Key)
SELECT
Key,
(SELECT TOP(1) A.Column1 FROM MyTable A WHERE A.Column1 IS NOT NULL AND A.Key=Keys.Key ORDER BY SortOrder) AS Column1,
(SELECT TOP(1) A.Column2 FROM MyTable A WHERE A.Column2 IS NOT NULL AND A.Key=Keys.Key ORDER BY SortOrder) AS Column2,
(SELECT TOP(1) A.Column3 FROM MyTable A WHERE A.Column3 IS NOT NULL AND A.Key=Keys.Key ORDER BY SortOrder) AS Column3
FROM Keys

Solution

You can use the new IGNORE NULLS option when doing FIRST_VALUE to get that you're looking for.

More old-school alternative is to use first_value(column1) OVER (partition BY [key] ORDER BY case when column1 is null then 1 else 0 end, sortorder) which manually puts the NULLs last.

Finally, a trick i like is to use aggregate together with padding to generate a combined sort + value column, after which the sort part is removed to leave the Value. This method has the upside that it doesn't require any window functions and is just a simple aggregate.

SELECT  [key]
,   max(c1) AS c1
,   max(c2) AS c2
,   max(c3) AS c3
,   MAX(c1_old) AS c1_old
,   MAX(c2_old) AS c2_old
,   MAX(c3_old) AS c3_old
,   STUFF(MIN(RIGHT(REPLICATE('0', 10) + CAST(SortOrder AS VARCHAR(10)), 10) + column1), 1, 10, '') AS c1_agg
,   STUFF(MIN(RIGHT(REPLICATE('0', 10) + CAST(SortOrder AS VARCHAR(10)), 10) + column2), 1, 10, '') AS c2_agg
,   STUFF(MIN(RIGHT(REPLICATE('0', 10) + CAST(SortOrder AS VARCHAR(10)), 10) + column3), 1, 10, '') AS c3_agg
FROM    (
    SELECT  *
    ,   first_value(column1) ignore nulls OVER(partition BY [key] ORDER BY sortorder rows between unbounded preceding and current row) c1
    ,   first_value(column2) ignore nulls OVER(partition BY [key] ORDER BY sortorder rows between unbounded preceding and current row) c2
    ,   first_value(column3) ignore nulls OVER(partition BY [key] ORDER BY sortorder rows between unbounded preceding and current row) c3
    
    ,   FIRST_VALUE(column1) OVER(partition BY [key] ORDER BY case when column1 is null then 1 else 0 end, sortorder rows between unbounded preceding and current row) AS c1_old
    ,   FIRST_VALUE(column2) OVER(partition BY [key] ORDER BY case when column2 is null then 1 else 0 end, sortorder rows between unbounded preceding and current row) AS c2_old
    ,   FIRST_VALUE(column3) OVER(partition BY [key] ORDER BY case when column3 is null then 1 else 0 end, sortorder rows between unbounded preceding and current row) AS c3_old
    
    FROM
    (
        VALUES  (N'Key1', N'C1', NULL, NULL, 1)
        ,   (N'Key1', N'C10', N'C11', NULL, 2)
        ,   (N'Key1', N'C20', N'C21', N'C22', 3)
        ,   (N'Key2', NULL, N'C30', NULL, 1)
        ,   (N'Key2', N'C40', N'C41', NULL, 2)
    ) t ([Key],Column1,Column2,Column3,SortOrder)
    ) x
GROUP BY [key]

Output:

key	c1	c2	c3	c1_old	c2_old	c3_old	c1_agg	c2_agg	c3_agg
Key1	C1	C11	C22	C1	C11	C22	C1	C11	C22
Key2	C40	C30	NULL	C40	C30	NULL	C40	C30	NULL