Search code examples
sqliterationaggregate-functionsranking-functions

Conditional aggregation in SQL Server identifying the latest non NULL entry


Starting Point:

I have a large table with many columns, of which the first is an identifier (ID) and the second a reversed rank over this ID (Rank). Each ID can have an arbitrary number of rows displaying either a NULL or a float value in each column with up to 50+ columns. There is no limiting range for the float values. Below is a small example table already grouped by the ID column.

ID Rank Column A Column B Column C Column D Column E
1 1 NULL 0.7 2.7 NULL 0.6
1 2 0.9 1.6 0.6 NULL NULL
1 3 NULL 0.4 2.3 NULL 0.3
1 4 0.6 NULL 1.4 NULL NULL

Goal:

I want to aggregate this table grouped by the ID, taking the non-NULL values with the highest rank. If only NULL values are found, a NULL value should be used. So the output for the table above should look like this:

ID Rank Column A Column B Column C Column D Column E
1 4 0.6 0.4 1.4 NULL 0.3

Standard aggregation functions such as MAX or MIN did not work for me because, exemplarily, in column B they would take 2.7 or 0.6, respectively, and not the correct value (1.4).

My current solution is outside of SQL and iterates for each ID over all columns returning the numeric value with the highest rank or NULL if there is no numeric value for a certain ID.

Any ideas on how to solve this in SQL?


Solution

  • This can be retrieved using window functions. Since you did not specify a database I used Postgresql. Here is a running example: http://sqlfiddle.com/#!17/68520/1.

    select 
    distinct
    "ID", 
    max("Rank") over (partition by "ID") "Rank",
    first_value("Column_A") over (partition by "ID" order by "Column_A" is null, "Rank" desc) "Column_A",
    first_value("Column_B") over (partition by "ID" order by "Column_B" is null, "Rank" desc) "Column_B",
    first_value("Column_C") over (partition by "ID" order by "Column_C" is null, "Rank" desc) "Column_C",
    first_value("Column_D") over (partition by "ID" order by "Column_D" is null, "Rank" desc) "Column_D",
    first_value("Column_E") over (partition by "ID" order by "Column_E" is null, "Rank" desc) "Column_E"
    from Table1 
    

    To create the data I used this DDL / SQL:

    CREATE TABLE Table1
    ("ID" int, "Rank" int, "Column_A" float, "Column_B" float, "Column_C" float, "Column_D" float, "Column_E" float);
    
    INSERT INTO Table1
    ("ID", "Rank", "Column_A", "Column_B", "Column_C", "Column_D", "Column_E")
    VALUES
    (1, 1, NULL, 0.7, 2.7, NULL, 0.6),
    (1, 2, 0.9, 1.6, 0.6, NULL, NULL),
    (1, 3, NULL, 0.4, 2.3, NULL, 0.3),
    (1, 4, 0.6, NULL, 1.4, NULL, NULL);
    

    The tricky part here is the use of a window function first_value and the needed ordering of values. All values for one ID value are sorted descending by Rank. Using "Column_A" is null for ordering moves all null values to the end (at least for Postgresql).

    So this SQL results in

    ID Rank Column_A Column_B Column_C Column_D Column_E
    1 4 0.6 0.4 1.4 (null) 0.3

    Update for SQLServer

    Using SQLServer this "Column_A" is null construct is not usable within order by. It has to be replaced by case when "Column_A" is null then 1 else 0 end. So the SQLServer SQL looks like:

    select 
    distinct
    "ID", 
    max("Rank") over (partition by "ID") "Rank",
    first_value("Column_A") over (partition by "ID" order by case when "Column_A" is null then 1 else 0 end, "Rank" desc) "Column_A",
    first_value("Column_B") over (partition by "ID" order by case when "Column_B" is null then 1 else 0 end, "Rank" desc) "Column_B",
    first_value("Column_C") over (partition by "ID" order by case when "Column_C" is null then 1 else 0 end, "Rank" desc) "Column_C",
    first_value("Column_D") over (partition by "ID" order by case when "Column_D" is null then 1 else 0 end, "Rank" desc) "Column_D",
    first_value("Column_E") over (partition by "ID" order by case when "Column_E" is null then 1 else 0 end, "Rank" desc) "Column_E"
    from Table1