Search code examples
sqlsql-serverperformancet-sqlsql-server-2022

Find corresponding records over multiple columns for each key value given a sort order


I'm looking for a (T-)SQL solution to the following question. I have a table with columns

  • Key (int, not null),
  • Column1 (varchar(30), null), Column2, Column3 (up to Column9 but for brevity I'll stop at 3),
  • SortOrder (int, not null)

Values in the Key column may be present in multiple rows. As a result set I need a row for each Key with values in the Column1-Column3 taken from potentially different rows in the initial data, where ColumnX value is not null and ordered by SortOrder. Hopefully it's easier to understand this from the example:

Key Column1 Column2 Column3 SortOrder
Key1 C1 NULL NULL 1
Key1 C10 C11 NULL 2
Key1 C20 C21 C22 3
Key2 NULL C30 NULL 1
Key2 C40 C41 NULL 2

The result set in this case would be

Key Column1 Column2 Column3
Key1 C1 C11 C22
Key2 C40 C30 NULL

There are 100K-200K rows in the table so I'm looking for a good performing solution. Ideally one which reads the table only once but I'm not sure it's even possible (I'll gladly take analytical functions if they help performance or readability in this case). Suggestions for indexing are also welcome though I assume it will be by (Key, SortOrder) anyway.

I have a query which returns the desired result but I'm not quite happy with readability and especially performance:

WITH Keys AS (SELECT Key FROM MyTable GROUP BY Key)
SELECT
Key,
(SELECT TOP(1) A.Column1 FROM MyTable A WHERE A.Column1 IS NOT NULL AND A.Key=Keys.Key ORDER BY SortOrder) AS Column1,
(SELECT TOP(1) A.Column2 FROM MyTable A WHERE A.Column2 IS NOT NULL AND A.Key=Keys.Key ORDER BY SortOrder) AS Column2,
(SELECT TOP(1) A.Column3 FROM MyTable A WHERE A.Column3 IS NOT NULL AND A.Key=Keys.Key ORDER BY SortOrder) AS Column3
FROM Keys

Solution

  • You can use the new IGNORE NULLS option when doing FIRST_VALUE to get that you're looking for.

    More old-school alternative is to use first_value(column1) OVER (partition BY [key] ORDER BY case when column1 is null then 1 else 0 end, sortorder) which manually puts the NULLs last.

    Finally, a trick i like is to use aggregate together with padding to generate a combined sort + value column, after which the sort part is removed to leave the Value. This method has the upside that it doesn't require any window functions and is just a simple aggregate.

    SELECT  [key]
    ,   max(c1) AS c1
    ,   max(c2) AS c2
    ,   max(c3) AS c3
    ,   MAX(c1_old) AS c1_old
    ,   MAX(c2_old) AS c2_old
    ,   MAX(c3_old) AS c3_old
    ,   STUFF(MIN(RIGHT(REPLICATE('0', 10) + CAST(SortOrder AS VARCHAR(10)), 10) + column1), 1, 10, '') AS c1_agg
    ,   STUFF(MIN(RIGHT(REPLICATE('0', 10) + CAST(SortOrder AS VARCHAR(10)), 10) + column2), 1, 10, '') AS c2_agg
    ,   STUFF(MIN(RIGHT(REPLICATE('0', 10) + CAST(SortOrder AS VARCHAR(10)), 10) + column3), 1, 10, '') AS c3_agg
    FROM    (
        SELECT  *
        ,   first_value(column1) ignore nulls OVER(partition BY [key] ORDER BY sortorder rows between unbounded preceding and current row) c1
        ,   first_value(column2) ignore nulls OVER(partition BY [key] ORDER BY sortorder rows between unbounded preceding and current row) c2
        ,   first_value(column3) ignore nulls OVER(partition BY [key] ORDER BY sortorder rows between unbounded preceding and current row) c3
        
        ,   FIRST_VALUE(column1) OVER(partition BY [key] ORDER BY case when column1 is null then 1 else 0 end, sortorder rows between unbounded preceding and current row) AS c1_old
        ,   FIRST_VALUE(column2) OVER(partition BY [key] ORDER BY case when column2 is null then 1 else 0 end, sortorder rows between unbounded preceding and current row) AS c2_old
        ,   FIRST_VALUE(column3) OVER(partition BY [key] ORDER BY case when column3 is null then 1 else 0 end, sortorder rows between unbounded preceding and current row) AS c3_old
        
        FROM
        (
            VALUES  (N'Key1', N'C1', NULL, NULL, 1)
            ,   (N'Key1', N'C10', N'C11', NULL, 2)
            ,   (N'Key1', N'C20', N'C21', N'C22', 3)
            ,   (N'Key2', NULL, N'C30', NULL, 1)
            ,   (N'Key2', N'C40', N'C41', NULL, 2)
        ) t ([Key],Column1,Column2,Column3,SortOrder)
        ) x
    GROUP BY [key]
    

    Output:

    key c1 c2 c3 c1_old c2_old c3_old c1_agg c2_agg c3_agg
    Key1 C1 C11 C22 C1 C11 C22 C1 C11 C22
    Key2 C40 C30 NULL C40 C30 NULL C40 C30 NULL