I want to join various nodes of a tree, making sure the the returned root-to-leaf path is temporally valid. The tricky part is that the data source is dated with validity from-to dates.
ID | NVALUE | VFROM | VTO |
---|---|---|---|
1 | A | 2021-01-01 | 2021-01-31 |
1 | B | 2021-02-01 | 2021-02-28 |
2 | C | 2021-01-01 | 2021-02-28 |
3 | D | 2021-01-01 | 2021-01-31 |
3 | E | 2021-02-01 | 2021-02-28 |
the links are trivially pointing to the node ids (but not their dates!)
LINK_CHILD | LINK_PARENT |
---|---|
1 | 2 |
2 | 3 |
from this I would like to return the valid paths and their validity dates:
A-C-D
valid from 2021-01-01
to 2021-01-31
B-C-E
valid from 2021-02-01
to 2021-02-28
invalid paths (e.g. A-C-E
should not be returned, since there is no moment in time in which all the three nodes are valid).
The issue I have with this is that the "overlap" check is not transitive (so A overlaps with B and B overlaps with C does not imply that A overlaps with C). So when writing the connect by
query each level overlaps with the next, but the resulting global path is invalid.
the basic query set up I have is
with src_nodes (id, nvalue, vfrom, vto) as (
select 1, 'A', date '2021-01-01', date '2021-01-31' from dual union all
select 1, 'B', date '2021-02-01', date '2021-02-28' from dual union all
select 2, 'C', date '2021-01-01', date '2021-02-28' from dual union all
select 3, 'D', date '2021-01-01', date '2021-01-31' from dual union all
select 3, 'E', date '2021-02-01', date '2021-02-28'
from dual
),
src_links(link_child, link_parent) as (
select 1, 2 from dual union all
select 2, 3 from dual
),
full_links as (
select c.*
from src_links c
union
select null, link_child
from src_links a
where not exists(select null from src_links b where b.link_parent = a.link_child)
),
nodes_and_links as (
select *
from full_links a
join src_nodes n on n.id = a.link_parent)
select *
from nodes_and_links nl
start with nl.link_child is null
connect by prior nl.link_parent = nl.link_child and
greatest(prior nl.vfrom, nl.vfrom) <
least(prior nl.vto, nl.vto)
I believe this is the most efficient way. One recursive with
query and another trivial query to get the leaves.
here's an example with a more complex data source:
with src_nodes as (
select 1 id, 'A' nvalue, date '2021-01-01' vfrom, date '2021-02-10' vto
from dual
union all
select 1, 'B', date '2021-02-15', date '2021-02-28'
from dual
union all
select 2, 'C', date '2021-01-01', date '2021-01-31'
from dual
union all
select 2, 'D', date '2021-02-01', date '2021-02-28'
from dual
union all
select 3, 'E', date '2021-01-01', date '2021-02-28'
from dual
union all
select 4, 'F', date '2021-01-01', date '2021-01-31'
from dual
union all
select 4, 'G', date '2021-02-01', date '2021-02-28'
from dual
union all
select 5, 'H', date '2021-02-01', date '2021-02-28'
from dual
union all
select 6, 'I', date '2021-02-10', date '2021-02-28'
from dual
),
src_links as (
select 1 link_child, 2 link_parent
from dual
union all
select 2, 3
from dual
union all
select 3, 4
from dual
union all
select 5, 6
from dual
),
-- use "recursive with" method instead of "connect by" to be able to
-- refine the validity dates as we walk the tree
hier (id, vfrom, vto, nvalue, lvl, root_id, tpath) as (
select sn.id, sn.vfrom, sn.vto, sn.nvalue, 1 lvl, sn.id, sn.nvalue || ''
from src_nodes sn
where -- start with nodes that have no incoming parent link
exists(select null from src_links a where a.link_child = sn.id)
and not exists(select null from src_links a where a.link_parent = sn.id)
union all
select sn.id,
greatest(sn.vfrom, hier.vfrom),
least(sn.vto, hier.vto),
sn.nvalue,
hier.lvl + 1 lvl,
hier.root_id,
hier.tpath || '-' || sn.nvalue
from hier
join src_links ln on ln.link_child = hier.id
join src_nodes sn on sn.id = ln.link_parent --
and greatest(sn.vfrom, hier.vfrom) < least(sn.vto, hier.vto)
) -- use "depth first" to be able to detect leaf nodes
search depth first by id set seq,
hier_leaves as (
select *
from (
select a.*,
-- a difference of one means it's a normal 'depth first' step. otherwise it's a leaf
(case lead(a.lvl) over (order by a.seq) - a.lvl
when 1 then 'inner'
else 'leaf' end) path_type
from hier a)
where path_type = 'leaf')
select hl.tpath, hl.vfrom, hl.vto
from hier_leaves hl;
I have now tested this approach against our data which have 300K nodes and 240K links, and the trees (plus some additional pivoting) are parsed in 6 seconds. Similar work was done in 10 minutes by the ETLs.