Search code examples
sqlsql-servergaps-and-islands

Overall Order from a Reduced Dataset Against a Large Dataset


There is a scheme used in a process to create a delta of changes in a dataset from one day to the next. The query needs to get the overall order of a record within the entire result set of a view, while returning only a subset of records. This, I guess, was going fine until an ugly view pushed the passed the 30 second ado timeout of the application. I thought it would be easy to fix this up, however, it is proving dauntless. The tables that the views utilize have millions of rows, but the indexing is maintained.

Query1 - Return just the 2 keys and row_number against millions of records on a view with a heap of left joins that may be optimized out. (I am having to change this from a select * with a row_number because the * is the problem)

Query2 - Join the big view against a key list of 25 ID's and then join Query1 that has the overall data offset.

Query3 Remove duplicates from joining the two datasets.

DECLARE @KeysToSelect TABLE(Key1 INT)
INSERT INTO @KeysToSelect VALUES(139743),(139878),(139953)

DECLARE @BigView TABLE(Key1 INT, Key2 INT, Area DECIMAL(18,2), BuildingType NVARCHAR(25))
INSERT @BigView VALUES
(100, NULL, 0,''),
(101, NULL, 0,''),
(200, NULL, 0,''),
(201, NULL, 0,''),
(139743, NULL, 8475.00,'Industrial'),
(139743, NULL, 593.00,  'Office'),
(139744, NULL, 0,''),
(139745, NULL, 0,''),
(139746, NULL, 0,''),
(139747, NULL, 0,''),
(139878, NULL, 1268.00,'Office'),
(139878, NULL, 15534.00,'Warehouse'),
(139879, NULL, 0,''),
(139880, NULL, 0,''),
(139881, NULL, 0,''),
(139953, 6, 20000.00,'Warehouse'),
(139953, 14,20000.00,'Office'),
(149956, NULL, 0,''),
(149957, NULL, 0,''),
(149958, NULL, 0,'')

;WITH OveralOrderInData AS
(
    SELECT
        --!!!! Cant't SELECT * FROM @BigView Here because it hits a 30 second timeout limit.  This is what the solve is for. 
        --It would be easy just to return the data with a rank, however, in this case going into a skinny slice of the data to get overall count
        --and joining that against a limited subset of the larger viuew returns in under s second
        --Grabbing the 2 keys and ranking takes less than a second, lots of left joins and views calling views
        ROW_NUMBER() OVER (ORDER BY vw.Key1 ASC, vw.Key2 ASC) AS OrderInData,
        CASE WHEN Keys.Key1 IS NULL THEN NULL ELSE SUM(CASE WHEN Keys.Key1 IS NULL THEN NULL ELSE 1 END) OVER (ORDER BY vw.Key2 ASC,vw.Key1 ASC ROWS UNBOUNDED PRECEDING) END AS MatchedOrderInSearchKeys,
        CASE WHEN Keys.Key1 IS NULL THEN 0 ELSE 1 END AS IsMatched,
        vw.Key1,
        vw.Key2
    FROM
        @BigView vw
        LEFT OUTER JOIN @KeysToSelect Keys ON Keys.Key1 = vw.Key1
)
,Normalized AS
(
    --Second dive into the data, this time, we can filter records based on a very selected Key list and join back our overall order data
    --this can cause doubles and quadruples. Basically, a count will be associated with each duplicate Key1 and Key 2 value.
    --somehow turn these over and rank the duplicates as 1 and 2 and 1 and 2 as opposed to 1 and 1 and 1 and 1 :(
    SELECT
        O.OrderInData,
        O.MatchedOrderInSearchKeys,
        O.Key1,
        O.Key2,
        vw.Area , 
        vw.BuildingType,    
        DENSE_RANK() OVER(PARTITION BY O.Key1 ,O.Key2 ORDER BY OrderInData) AS DistributeOrder, 
        *
    FROM
        OveralOrderInData O
        INNER JOIN @BigView vw ON vw.Key1 = O.Key1 AND (vw.Key2 = O.Key2 OR O.Key2 IS NULL)
    WHERE
        O.IsMatched = 1
)
SELECT 
    * 
FROM 
    Normalized
WHERE   
    DistributeOrder = 1

The query above returns data in which the first OrderInData found for duplicates of Key1 and Key2 is realized, see below:

OrderInData Key1 Key2 Area BuildingType
5 139743 NULL 8475.00 Industrial
5 139743 NULL 593.00 Office
11 139878 NULL 1268.00 Office
11 139878 NULL 15534.00 Warehouse
16 139953 6 20000.00 Warehouse
17 139953 14 20000.00 Office

Desired Output

OrderInData Key1 Key2 Area BuildingType
5 139743 NULL 8475.00 Industrial
6 139743 NULL 593.00 Office
11 139878 NULL 1268.00 Office
12 139878 NULL 15534.00 Warehouse
16 139953 6 20000.00 Warehouse
17 139953 14 20000.00 Office

My next move would be to create virtualized tables in subqueries containing the distinct orders against items in keys and join again, however, that would take another two subqueries, and this was supposed to be a quick fix. Is there a silver bullet that I am just too daft to see at this point?


Solution

  • Without trying to reverse engineer anything before it, the final select can just have an expression sourced in row number:

    SELECT
        RealOrderInData = OrderInData - 1
          + ROW_NUMBER() OVER (PARTITION BY MatchedOrderInSearchKeys 
            ORDER BY BuildingType), -- should pick something more deterministic
        * 
    FROM 
        Normalized
    WHERE   
        DistributeOrder = 1;
    

    It's quite possible the rest of the query could be consolidated / simplified, but that's a tall order at night. :-)