Search code examples
sqloracle-databaseselectoracle12casset-management

Aggregate Overlapping Segments to Measure Effective Length


I have a road_events table:

create table road_events (
    event_id number(4,0),
    road_id number(4,0),
    year number(4,0),
    from_meas number(10,2),
    to_meas number(10,2),
    total_road_length number(10,2)
    );

insert into road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) values (1,1,2020,25,50,100);
insert into road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) values (2,1,2000,25,50,100);
insert into road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) values (3,1,1980,0,25,100);
insert into road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) values (4,1,1960,75,100,100);
insert into road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) values (5,1,1940,1,100,100);

insert into road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) values (6,2,2000,10,30,100);
insert into road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) values (7,2,1975,30,60,100);
insert into road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) values (8,2,1950,50,90,100);

insert into road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) values (9,3,2050,40,90,100);

insert into road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) values (10,4,2040,0,200,200);
insert into road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) values (11,4,2013,0,199,200);
insert into road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) values (12,4,2001,0,200,200);

insert into road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) values (13,5,1985,50,70,300);
insert into road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) values (14,5,1985,10,50,300);
insert into road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) values (15,5,1965,1,301,300);
commit;

select * from road_events;

  EVENT_ID    ROAD_ID       YEAR  FROM_MEAS    TO_MEAS TOTAL_ROAD_LENGTH
---------- ---------- ---------- ---------- ---------- -----------------
         1          1       2020         25         50               100
         2          1       2000         25         50               100
         3          1       1980          0         25               100
         4          1       1960         75        100               100
         5          1       1940          1        100               100

         6          2       2000         10         30               100
         7          2       1975         30         60               100
         8          2       1950         50         90               100

         9          3       2050         40         90               100

        10          4       2040          0        200               200
        11          4       2013          0        199               200
        12          4       2001          0        200               200

        13          5       1985         50         70               300
        14          5       1985         10         50               300
        15          5       1965          1        301               300

I want to select the events that represent the most recent work on each road.

This is a tricky operation, because the events often pertain to only a portion of the road. This means that I can't simply select the most recent event per road; I need to only select the most recent event mileage that doesn't overlap.


Possible logic (in order):

I'm reluctant to guess at how this problem could be solved, because it could end up hurting more than it helps (kind of like the XY Problem). On the other hand, it might provide insight into the nature of the problem, so here it goes:

  1. Select the most recent event for each road. We'll call the most recent event: event A.
  2. If event A is >= total_road_length, then that's all I need. The algorithm ends here.
  3. Else, get the next chronological event (event B) that does not have the same extents as event A.
  4. If the extents of event B overlap the extents of event A, then only get the portion(s) of event B that do not overlap.
  5. Repeat steps 3 and 4 until the total event length is = total_road_length. Or stop when there are no more events for that road.

Question:

I know it's a tall order, but what would it take to do this?

This is a classic linear referencing problem. It would be extremely helpful if I could do linear referencing operations as part of queries.

The result would be:

  EVENT_ID    ROAD_ID       YEAR  TOTAL_ROAD_LENGTH   EVENT_LENGTH
---------- ---------- ----------  -----------------   ------------
         1          1       2020                100             25
         3          1       1980                100             25
         4          1       1960                100             25
         5          1       1940                100             25

         6          2       2000                100             20
         7          2       1975                100             30
         8          2       1950                100             30

         9          3       2050                100             50

        10          4       2040                200            200

        13          5       1985                300             20
        14          5       1985                300             40
        15          5       1965                300            240

Related question: Select where number range does not overlap


Solution

  • My main DBMS is Teradata, but this will work as-is in Oracle, too.

    WITH all_meas AS
     ( -- get a distinct list of all from/to points
       SELECT road_id, from_meas AS meas
       FROM road_events
       UNION
       SELECT road_id, to_meas
       FROM road_events
     )
    -- select * from all_meas order by 1,2
     , all_ranges AS
     ( -- create from/to ranges
       SELECT road_id, meas AS from_meas 
         ,Lead(meas)
          Over (PARTITION BY road_id
                ORDER BY meas) AS to_meas
       FROM all_meas
      )
     -- SELECT * from all_ranges order by 1,2
    , all_event_ranges AS
     ( -- now match the ranges to the event ranges
       SELECT 
          ar.*
         ,re.event_id
         ,re.year
         ,re.total_road_length
         ,ar.to_meas - ar.from_meas AS event_length
         -- used to filter the latest event as multiple events might cover the same range 
         ,Row_Number()
          Over (PARTITION BY ar.road_id, ar.from_meas
                ORDER BY year DESC) AS rn
       FROM all_ranges ar
       JOIN road_events re
         ON ar.road_id = re.road_id
        AND ar.from_meas < re.to_meas
        AND ar.to_meas > re.from_meas
       WHERE ar.to_meas IS NOT NULL
     )
    SELECT event_id, road_id, year, total_road_length, Sum(event_length)
    FROM all_event_ranges
    WHERE rn = 1 -- latest year only
    GROUP BY event_id, road_id, year, total_road_length
    ORDER BY road_id, year DESC;
    

    If you need to return the actual covered from/to_meas (as in your question before edit), it might be more complicated. The first part is the same, but without aggregation the query can return adjacent rows with the same event_id (e.g. for event 3: 0-1 & 1-25):

    SELECT * FROM all_event_ranges
    WHERE rn = 1
    ORDER BY road_id, from_meas;
    

    If you want to merge adjacent rows you need two more steps (using a standard approach, flag the 1st row of a group and calculate a group number):

    WITH all_meas AS
     (
       SELECT road_id, from_meas AS meas
       FROM road_events
       UNION
       SELECT road_id, to_meas
       FROM road_events
     )
    -- select * from all_meas order by 1,2
     , all_ranges AS
     ( 
       SELECT road_id, meas AS from_meas 
         ,Lead(meas)
          Over (PARTITION BY road_id
                ORDER BY meas) AS to_meas
       FROM all_meas
      )
    -- SELECT * from all_ranges order by 1,2
    , all_event_ranges AS
     (
       SELECT 
          ar.*
         ,re.event_id
         ,re.year
         ,re.total_road_length
         ,ar.to_meas - ar.from_meas AS event_length
         ,Row_Number()
          Over (PARTITION BY ar.road_id, ar.from_meas
                ORDER BY year DESC) AS rn
       FROM all_ranges ar
       JOIN road_events  re
         ON ar.road_id = re.road_id
        AND ar.from_meas < re.to_meas
        AND ar.to_meas > re.from_meas
       WHERE ar.to_meas IS NOT NULL
     )
    -- SELECT * FROM all_event_ranges WHERE rn = 1 ORDER BY road_id, from_meas
    , adjacent_events AS 
     ( -- assign 1 to the 1st row of an event
       SELECT t.*
         ,CASE WHEN Lag(event_id)
                    Over(PARTITION BY road_id
                         ORDER BY from_meas) = event_id
               THEN 0 
               ELSE 1 
          END AS flag
       FROM all_event_ranges t
       WHERE rn = 1
     )
    -- SELECT * FROM adjacent_events ORDER BY road_id, from_meas 
    , grouped_events AS
     ( -- assign a groupnumber to adjacent rows using a Cumulative Sum over 0/1
       SELECT t.*
         ,Sum(flag)
          Over (PARTITION BY road_id
                ORDER BY from_meas
                ROWS Unbounded Preceding) AS grp
       FROM adjacent_events t
    )
    -- SELECT * FROM grouped_events ORDER BY  road_id, from_meas
    SELECT event_id, road_id, year, Min(from_meas), Max(to_meas), total_road_length, Sum(event_length)
    FROM grouped_events
    GROUP BY event_id, road_id, grp, year, total_road_length
    ORDER BY 2, Min(from_meas);
    

    Edit:

    Ups, I just found a blog Overlapping ranges with priority doing exactly the same with some simplified Oracle syntax. In fact I translated my query from a some other simplified syntax in Teradata to Standard/Oracle SQL :-)