Search code examples
sqlpostgresqljoindate-range

SQL: join tables with valid from / valid to fields


I'm faced to a very generic issue in our Postgresql database. Many tables contain entries that are only valid for a period of time, for instance the details of contracts that are likely to evolve over time.

In order to deal with it, two fields, valid from and valid to, are provided to indicate the validity period of the row content. Each time a contract changes, a row is added to the table with the actualized information and the corresponding valid dates.

The main deal arises when performing joins on tables with overlapping validity fields. More precisely, given a first table:

fg     valid_from    valid_to    attr_table1
key1   2020-01-01   2020-01-18        A
key1   2020-01-19   null              B
key2   2020-01-01   2020-01-30        A
key2   2020-01-30   null              B

and a second table

fg     valid_from    valid_to    attr_table2
key1   2020-01-01   2020-01-10       1.0
key1   2020-01-10   null             3.0
key2   2020-01-01   2020-01-30      10.0
key2   2020-01-30   null            11.0

I want to build a joined table with a validity field that embedds the validity periods of both tables, such as:

fg     valid_from   valid_to    attr_table1  attr_table2
key1   2020-01-01   2020-01-10        A         1.0
key1   2020-01-10   2020-01-18        A         3.0
key1   2020-01-18   null              B         3.0
key2   2020-01-01   2020-01-30        A         10.0
key2   2020-01-30   null              B         11.0

Until now, my most convincing attempt was to switch to the Postgresql-specific type daterange, and use the && operator ("have common points"). I concatenated both valid from and valid to fields into a validity field, and next query almost seems to do the job:

select t1.fg,
       (case when upper(t1.validity) is null
             then case when (upper(t2.validity) is null) 
                       then case when lower(t1.validity) > lower(t2.validity) 
                                 then daterange(lower(t1.validity), null)
                                 else daterange(lower(t2.validity), null)
                                 end
                       else case when lower(t1.validity) > lower(t2.validity) 
                                 then daterange(lower(t1.validity), upper(t2.validity)) 
                                 else daterange(lower(t2.validity), upper(t2.validity)) 
                                 end
                       end
             when upper(t2.validity) is null
             then case when (upper(t1.validity) is null) 
                       then case when lower(t1.validity) > lower(t2.validity) 
                                 then daterange(lower(t1.validity), null)
                                 else daterange(lower(t2.validity), null)
                                 end
                       else case when lower(t1.validity) > lower(t2.validity) 
                                 then daterange(lower(t1.validity), upper(t1.validity)) 
                                 else daterange(lower(t2.validity), upper(t1.validity)) 
                                 end
                       end
             when lower(t1.validity) <= lower(t2.validity)
             then case when upper(t1.validity) >= upper(t2.validity) 
                       then daterange(lower(t2.validity), upper(t2.validity))
                       else daterange(lower(t2.validity), upper(t1.validity))
                       end
             else case when upper(t1.validity) >= upper(t2.validity) 
                       then daterange(lower(t1.validity), upper(t2.validity))
                       else daterange(lower(t1.validity), upper(t1.validity))
                       end
             end
            ) as validity,
       t1.attr_table1, 
       t2.attr_table2
  from table1 as t1 
       join table2 as t2
         on t1.fg = t2.fg
        and t1.validity && t2.validity
order by fg, validity

However, this query fails when the starting point of the first table does not match any entry in the second one. For instance, with an additional row in the first and second tables like

In table 1:

key1 2019-12-25 2020-01-01 A

In table 2:

key1 2019-12-27 2020-01-01 -1

The resulting first row of the output table is

key1 2019-12-27 2020-01-01 A -1

instead of

key1   2019-12-25   2019-12-27        A    null
key1   2019-12-27   2020-01-01        A    -1 

Is anyone aware of a better approach ?

EDIT: The code to create table1 and table2, relying on daterange:

create table table1
( 
  fg text, 
  validity daterange, 
  attr_table1 text
);
insert into table1 
values
('key1', daterange('2020-01-01', '2020-01-18'),  'A'),
('key1', daterange('2020-01-18', null        ),  'B'),
('key2', daterange('2020-01-01', '2020-01-30'),  'A'),
('key2', daterange('2020-01-30', null        ),  'B')

and

create table table2
( 
  fg text, 
  validity daterange,  
  attr_table2 text
);
insert into table2 
values
('key1', daterange('2020-01-01', '2020-01-10'),   1.0),
('key1', daterange('2020-01-10', null        ),   3.0),
('key2', daterange('2020-01-01', '2020-01-30'),  10.0),
('key2', daterange('2020-01-30', null        ),  11.0)

Solution

  • [UPDATE]

    • make a calendar table of all existing timespans per fg
    • LEFT join both table1 and table2 to this table
    • [for ease of comparison I changed the valid_to NULLs to infinity]

    create table table1
    (
      fg text,
      validity daterange,
      attr_table1 text
    );
    
    insert into table1
    values
    ('key1', daterange('2019-12-25', '2020-01-01'),  'A'), -- NEW
    ('key1', daterange('2020-01-01', '2020-01-18'),  'A'),
    ('key1', daterange('2020-01-19', 'infinity'        ),  'B'),
    ('key2', daterange('2020-01-01', '2020-01-30'),  'A'),
    ('key2', daterange('2020-01-30', 'infinity'        ),  'B');
    
    create table table2
    (
      fg text,
      validity daterange,
      attr_table2 text
    );
    insert into table2
    values
    ('key1', daterange('2019-12-27', '2020-01-01'),  -1  ), -- NEW
    ('key1', daterange('2020-01-01', '2020-01-10'),   1.0),
    ('key1', daterange('2020-01-10', 'infinity'        ),   3.0),
    ('key2', daterange('2020-01-01', '2020-01-30'),  10.0),
    ('key2', daterange('2020-01-30', 'infinity'        ),  11.0);
    
    
            -- Make a 'CALENDAR' table with all points in time (per fg)
            -- ---------------------------------------------------------
    WITH pits AS (
            select distinct fg, lower(validity) as pit FROM table1
    UNION
            select distinct fg, upper(validity) as pit FROM table1
    UNION
            select distinct fg, lower(validity) as pit FROM table2
    UNION
            select distinct fg, upper(validity) as pit FROM table2
            )
            -- combine all adjacent PITs to ranges
            -- ---------------------------------------
    , pairs AS (
            SELECT fg, pit AS opit
            , lead(pit) OVER (PARTITION BY fg ORDER BY pit) AS npit
            from pits
            )
            -- Make dateranges from them
            -- --------------------------
    , tablex AS (
            SELECT fg
            , daterange(opit,npit) AS validity
            FROM pairs
            WHERE npit IS NOT NULL
            -- ORDER BY 1,2;
            )
            -- Left join both table1 and table2 to all_rages
            -- ----------------------------------------------
    SELECT tx.fg
            , tx.validity
            , t1.validity * t2.validity AS overlapped
            , t1.attr_table1
            , t2.attr_table2
    FROM tablex tx
    LEFT JOIN table1 t1 ON t1.fg = tx.fg AND t1.validity && tx.validity
    LEFT JOIN table2 t2 ON t2.fg = tx.fg AND t2.validity && tx.validity
    ORDER BY 1,2
            ;
    

    Results:


    DROP SCHEMA
    CREATE SCHEMA
    SET
    CREATE TABLE
    INSERT 0 5
    CREATE TABLE
    INSERT 0 5
      fg  |        validity         |       overlapped        | attr_table1 | attr_table2 
    ------+-------------------------+-------------------------+-------------+-------------
     key1 | [2019-12-25,2019-12-27) |                         | A           | 
     key1 | [2019-12-27,2020-01-01) | [2019-12-27,2020-01-01) | A           | -1
     key1 | [2020-01-01,2020-01-10) | [2020-01-01,2020-01-10) | A           | 1.0
     key1 | [2020-01-10,2020-01-18) | [2020-01-10,2020-01-18) | A           | 3.0
     key1 | [2020-01-18,2020-01-19) |                         |             | 3.0
     key1 | [2020-01-19,infinity)   | [2020-01-19,infinity)   | B           | 3.0
     key2 | [2020-01-01,2020-01-30) | [2020-01-01,2020-01-30) | A           | 10.0
     key2 | [2020-01-30,infinity)   | [2020-01-30,infinity)   | B           | 11.0
    (8 rows)