I'm faced to a very generic issue in our Postgresql database. Many tables contain entries that are only valid for a period of time, for instance the details of contracts that are likely to evolve over time.
In order to deal with it, two fields, valid from
and valid to
, are provided to indicate the validity period of the row content. Each time a contract changes, a row is added to the table with the actualized information and the corresponding valid dates.
The main deal arises when performing joins on tables with overlapping validity fields. More precisely, given a first table:
fg valid_from valid_to attr_table1
key1 2020-01-01 2020-01-18 A
key1 2020-01-19 null B
key2 2020-01-01 2020-01-30 A
key2 2020-01-30 null B
and a second table
fg valid_from valid_to attr_table2
key1 2020-01-01 2020-01-10 1.0
key1 2020-01-10 null 3.0
key2 2020-01-01 2020-01-30 10.0
key2 2020-01-30 null 11.0
I want to build a joined table with a validity field that embedds the validity periods of both tables, such as:
fg valid_from valid_to attr_table1 attr_table2
key1 2020-01-01 2020-01-10 A 1.0
key1 2020-01-10 2020-01-18 A 3.0
key1 2020-01-18 null B 3.0
key2 2020-01-01 2020-01-30 A 10.0
key2 2020-01-30 null B 11.0
Until now, my most convincing attempt was to switch to the Postgresql-specific type daterange
, and use the &&
operator ("have common points"). I concatenated both valid from
and valid to
fields into a validity
field, and next query almost seems to do the job:
select t1.fg,
(case when upper(t1.validity) is null
then case when (upper(t2.validity) is null)
then case when lower(t1.validity) > lower(t2.validity)
then daterange(lower(t1.validity), null)
else daterange(lower(t2.validity), null)
end
else case when lower(t1.validity) > lower(t2.validity)
then daterange(lower(t1.validity), upper(t2.validity))
else daterange(lower(t2.validity), upper(t2.validity))
end
end
when upper(t2.validity) is null
then case when (upper(t1.validity) is null)
then case when lower(t1.validity) > lower(t2.validity)
then daterange(lower(t1.validity), null)
else daterange(lower(t2.validity), null)
end
else case when lower(t1.validity) > lower(t2.validity)
then daterange(lower(t1.validity), upper(t1.validity))
else daterange(lower(t2.validity), upper(t1.validity))
end
end
when lower(t1.validity) <= lower(t2.validity)
then case when upper(t1.validity) >= upper(t2.validity)
then daterange(lower(t2.validity), upper(t2.validity))
else daterange(lower(t2.validity), upper(t1.validity))
end
else case when upper(t1.validity) >= upper(t2.validity)
then daterange(lower(t1.validity), upper(t2.validity))
else daterange(lower(t1.validity), upper(t1.validity))
end
end
) as validity,
t1.attr_table1,
t2.attr_table2
from table1 as t1
join table2 as t2
on t1.fg = t2.fg
and t1.validity && t2.validity
order by fg, validity
However, this query fails when the starting point of the first table does not match any entry in the second one. For instance, with an additional row in the first and second tables like
In table 1:
key1 2019-12-25 2020-01-01 A
In table 2:
key1 2019-12-27 2020-01-01 -1
The resulting first row of the output table is
key1 2019-12-27 2020-01-01 A -1
instead of
key1 2019-12-25 2019-12-27 A null
key1 2019-12-27 2020-01-01 A -1
Is anyone aware of a better approach ?
EDIT: The code to create table1 and table2, relying on daterange:
create table table1
(
fg text,
validity daterange,
attr_table1 text
);
insert into table1
values
('key1', daterange('2020-01-01', '2020-01-18'), 'A'),
('key1', daterange('2020-01-18', null ), 'B'),
('key2', daterange('2020-01-01', '2020-01-30'), 'A'),
('key2', daterange('2020-01-30', null ), 'B')
and
create table table2
(
fg text,
validity daterange,
attr_table2 text
);
insert into table2
values
('key1', daterange('2020-01-01', '2020-01-10'), 1.0),
('key1', daterange('2020-01-10', null ), 3.0),
('key2', daterange('2020-01-01', '2020-01-30'), 10.0),
('key2', daterange('2020-01-30', null ), 11.0)
[UPDATE]
fg
valid_to
NULLs to infinity
]create table table1
(
fg text,
validity daterange,
attr_table1 text
);
insert into table1
values
('key1', daterange('2019-12-25', '2020-01-01'), 'A'), -- NEW
('key1', daterange('2020-01-01', '2020-01-18'), 'A'),
('key1', daterange('2020-01-19', 'infinity' ), 'B'),
('key2', daterange('2020-01-01', '2020-01-30'), 'A'),
('key2', daterange('2020-01-30', 'infinity' ), 'B');
create table table2
(
fg text,
validity daterange,
attr_table2 text
);
insert into table2
values
('key1', daterange('2019-12-27', '2020-01-01'), -1 ), -- NEW
('key1', daterange('2020-01-01', '2020-01-10'), 1.0),
('key1', daterange('2020-01-10', 'infinity' ), 3.0),
('key2', daterange('2020-01-01', '2020-01-30'), 10.0),
('key2', daterange('2020-01-30', 'infinity' ), 11.0);
-- Make a 'CALENDAR' table with all points in time (per fg)
-- ---------------------------------------------------------
WITH pits AS (
select distinct fg, lower(validity) as pit FROM table1
UNION
select distinct fg, upper(validity) as pit FROM table1
UNION
select distinct fg, lower(validity) as pit FROM table2
UNION
select distinct fg, upper(validity) as pit FROM table2
)
-- combine all adjacent PITs to ranges
-- ---------------------------------------
, pairs AS (
SELECT fg, pit AS opit
, lead(pit) OVER (PARTITION BY fg ORDER BY pit) AS npit
from pits
)
-- Make dateranges from them
-- --------------------------
, tablex AS (
SELECT fg
, daterange(opit,npit) AS validity
FROM pairs
WHERE npit IS NOT NULL
-- ORDER BY 1,2;
)
-- Left join both table1 and table2 to all_rages
-- ----------------------------------------------
SELECT tx.fg
, tx.validity
, t1.validity * t2.validity AS overlapped
, t1.attr_table1
, t2.attr_table2
FROM tablex tx
LEFT JOIN table1 t1 ON t1.fg = tx.fg AND t1.validity && tx.validity
LEFT JOIN table2 t2 ON t2.fg = tx.fg AND t2.validity && tx.validity
ORDER BY 1,2
;
Results:
DROP SCHEMA
CREATE SCHEMA
SET
CREATE TABLE
INSERT 0 5
CREATE TABLE
INSERT 0 5
fg | validity | overlapped | attr_table1 | attr_table2
------+-------------------------+-------------------------+-------------+-------------
key1 | [2019-12-25,2019-12-27) | | A |
key1 | [2019-12-27,2020-01-01) | [2019-12-27,2020-01-01) | A | -1
key1 | [2020-01-01,2020-01-10) | [2020-01-01,2020-01-10) | A | 1.0
key1 | [2020-01-10,2020-01-18) | [2020-01-10,2020-01-18) | A | 3.0
key1 | [2020-01-18,2020-01-19) | | | 3.0
key1 | [2020-01-19,infinity) | [2020-01-19,infinity) | B | 3.0
key2 | [2020-01-01,2020-01-30) | [2020-01-01,2020-01-30) | A | 10.0
key2 | [2020-01-30,infinity) | [2020-01-30,infinity) | B | 11.0
(8 rows)