Search code examples
sqloracle-databasehierarchical-data

temporal joins in hierarchical query


I want to join various nodes of a tree, making sure the the returned root-to-leaf path is temporally valid. The tricky part is that the data source is dated with validity from-to dates.

ID NVALUE VFROM VTO
1 A 2021-01-01 2021-01-31
1 B 2021-02-01 2021-02-28
2 C 2021-01-01 2021-02-28
3 D 2021-01-01 2021-01-31
3 E 2021-02-01 2021-02-28

the links are trivially pointing to the node ids (but not their dates!)

LINK_CHILD LINK_PARENT
1 2
2 3

from this I would like to return the valid paths and their validity dates:

  1. A-C-D valid from 2021-01-01 to 2021-01-31
  2. B-C-E valid from 2021-02-01 to 2021-02-28

invalid paths (e.g. A-C-E should not be returned, since there is no moment in time in which all the three nodes are valid).

The issue I have with this is that the "overlap" check is not transitive (so A overlaps with B and B overlaps with C does not imply that A overlaps with C). So when writing the connect by query each level overlaps with the next, but the resulting global path is invalid.

the basic query set up I have is

with src_nodes (id, nvalue, vfrom, vto) as (
    select 1, 'A', date '2021-01-01', date '2021-01-31' from dual union all
    select 1, 'B', date '2021-02-01', date '2021-02-28' from dual union all
    select 2, 'C', date '2021-01-01', date '2021-02-28' from dual union all
    select 3, 'D', date '2021-01-01', date '2021-01-31' from dual union all
    select 3, 'E', date '2021-02-01', date '2021-02-28'
    from dual
),
     src_links(link_child, link_parent) as (
         select 1, 2 from dual union all
         select 2, 3 from dual
     ),
     full_links as (
         select c.*
         from src_links c
         union
         select null, link_child
         from src_links a
         where not exists(select null from src_links b where b.link_parent = a.link_child)
     ),
     nodes_and_links as (
         select *
         from full_links a
                  join src_nodes n on n.id = a.link_parent)
select *
from nodes_and_links nl
start with nl.link_child is null
connect by prior nl.link_parent = nl.link_child and
           greatest(prior nl.vfrom, nl.vfrom) < 
           least(prior nl.vto, nl.vto)

Solution

  • I believe this is the most efficient way. One recursive with query and another trivial query to get the leaves.

    here's an example with a more complex data source:

    dbfiddle

    with src_nodes as (
        select 1 id, 'A' nvalue, date '2021-01-01' vfrom, date '2021-02-10' vto
        from dual
        union all
        select 1, 'B', date '2021-02-15', date '2021-02-28'
        from dual
        union all
        select 2, 'C', date '2021-01-01', date '2021-01-31'
        from dual
        union all
        select 2, 'D', date '2021-02-01', date '2021-02-28'
        from dual
        union all
        select 3, 'E', date '2021-01-01', date '2021-02-28'
        from dual
        union all
        select 4, 'F', date '2021-01-01', date '2021-01-31'
        from dual
        union all
        select 4, 'G', date '2021-02-01', date '2021-02-28'
        from dual
        union all
        select 5, 'H', date '2021-02-01', date '2021-02-28'
        from dual
        union all
        select 6, 'I', date '2021-02-10', date '2021-02-28'
        from dual
    
    ),
         src_links as (
             select 1 link_child, 2 link_parent
             from dual
             union all
             select 2, 3
             from dual
             union all
             select 3, 4
             from dual
             union all
             select 5, 6
             from dual
         ),
         -- use "recursive with" method instead of "connect by" to be able to
         -- refine the validity dates as we walk the tree
         hier (id, vfrom, vto, nvalue, lvl, root_id, tpath) as (
             select sn.id, sn.vfrom, sn.vto, sn.nvalue, 1 lvl, sn.id, sn.nvalue || ''
             from src_nodes sn
             where -- start with nodes that have no incoming parent link
                 exists(select null from src_links a where a.link_child = sn.id)
               and not exists(select null from src_links a where a.link_parent = sn.id)
             union all
             select sn.id,
                    greatest(sn.vfrom, hier.vfrom),
                    least(sn.vto, hier.vto),
                    sn.nvalue,
                    hier.lvl + 1 lvl,
                    hier.root_id,
                    hier.tpath || '-' || sn.nvalue
             from hier
                      join src_links ln on ln.link_child = hier.id
                      join src_nodes sn on sn.id = ln.link_parent --
                 and greatest(sn.vfrom, hier.vfrom) < least(sn.vto, hier.vto)
         ) -- use "depth first" to be able to detect leaf nodes
             search depth first by id set seq,
         hier_leaves as (
             select *
             from (
                      select a.*,
                             -- a difference of one means it's a normal 'depth first' step. otherwise it's a leaf
                             (case lead(a.lvl) over (order by a.seq) - a.lvl
                                  when 1 then 'inner'
                                  else 'leaf' end) path_type
                      from hier a)
             where path_type = 'leaf')
    select hl.tpath, hl.vfrom, hl.vto
    from hier_leaves hl;
    

    I have now tested this approach against our data which have 300K nodes and 240K links, and the trees (plus some additional pivoting) are parsed in 6 seconds. Similar work was done in 10 minutes by the ETLs.