I have a standing table of historical transactions and I would like to insert records from a new table where the status of the new record is different from the old at the time that the new record was "valid" (based on a valid_from
column in both tables).
For example, take the standing/transactional table (I'll call it old
):
CREATE TABLE old (id serial, key text, valid_from date, status text);
INSERT INTO old (key, valid_from, status)
VALUES
('A', '2014-01-01', 'x'),
('A', '2014-02-01', 'y'),
('A', '2014-03-01', 'z'),
('B', '2014-01-10', 'x'),
('B', '2014-02-15', 'y');
SELECT * FROM old;
id | key | valid_from | status
----+-----+------------+--------
1 | A | 2014-01-01 | x
2 | A | 2014-02-01 | y
3 | A | 2014-03-01 | z
4 | B | 2014-01-10 | x
5 | B | 2014-02-15 | y
And the new
table of updates:
CREATE TABLE new (id serial, key text, valid_from date, status text);
INSERT INTO new (key, valid_from, status)
VALUES
('A', '2014-01-15', 'x'),
('A', '2014-02-15', 'x'),
('A', '2014-03-15', 'z'),
('B', '2014-02-15', 'y');
SELECT * FROM new;
id | key | valid_from | status
----+-----+------------+--------
1 | A | 2014-01-15 | x
2 | A | 2014-02-15 | x
3 | A | 2014-03-15 | z
4 | B | 2014-02-15 | y
Ideally, I would like to insert only row 2 of the new table because it has a different status than record with the same key
in the old
table with the most recent valid_from
date prior to the new record. Specifically:
id
= 1 in the new table should not be inserted. The maximum valid_from
date in the old
table for key A
('2014-01-01') prior to the new valid_from
date ('2014-01-15'); i.e., they both have a status of x
.id
= 2 in the new table should be inserted because its status
(x
) is different than the status
in the old
table (y
) where the valid_from
date in the old table was equal to the maximum valid_from
for that key that is prior to the valid_from
in the new table.id
= 3 should not be inserted because it shares the same status (x
) as the record in the old table for the most recent valid_from
id
= 4 should not be inserted because, again, it's status (y
) is identical to the corresponding record in the old
table.After the insert, the old
table should look like:
SELECT * FROM old ORDER BY key, valid_from;
id | key | valid_from | status
----+-----+------------+--------
1 | A | 2014-01-01 | x
2 | A | 2014-02-01 | y
6 | A | 2014-02-15 | x
3 | A | 2014-03-01 | z
4 | B | 2014-01-10 | x
5 | B | 2014-02-15 | y
The goal: there should never be two successive records with the same key
and status
when sorted by ascending valid_from
.
If the idea were to simply insert records with a key
+ status
combination that did not exist in the old
table, I would do something like this:
INSERT INTO old (key, valid_from, status)
SELECT new.key, new.valid_from, new.status
FROM new LEFT JOIN old USING(key, status)
WHERE old.key IS NULL;
However, this misses id
2 in the new table since id
1 in the old table shares the same status, even though there is a more recent record in the old
table that does indeed differ.
Whatever I try, I can't seem to check the new
record against only the most "recent" old
record with a valid_from
< the valid_from
in the new record. I have a feeling I can do it with a combination of a window function and/or a DISTINCT ON to find the maximum valid_from, but I can't quite seem to put the two together. For instance, I've tried things like:
INSERT INTO old(key, valid_from, status)
SELECT n.key, n.valid_from, n.status
FROM new n LEFT JOIN (
SELECT DISTINCT ON (old.key) old.*
FROM new
JOIN old
ON old.key = new.key
AND old.valid_from < new.valid_from
ORDER BY old.key, old.valid_from DESC) o
ON
n.key = o.key
AND n.status = o.status
WHERE
o.key IS NULL;
However, this winds up inserting the first record from the new
table:
id | key | valid_from | status
----+-----+------------+--------
1 | A | 2014-01-01 | x
6 | A | 2014-01-15 | x
2 | A | 2014-02-01 | y
7 | A | 2014-02-15 | x
3 | A | 2014-03-01 | z
4 | B | 2014-01-10 | x
8 | B | 2014-02-15 | y
5 | B | 2014-02-15 | y
I'm using PostgreSQL 9.3. Any pointers on an approach to this problem? Thanks in advance!
After some more toying, I think I have a solution (although probably not the best one). The logical sequence of steps would be:
valid_from
is greater than the old record's valid_from
valid_from
is the most recent valid_from
among all candidate matches (we'll call it is_most_recent
). I did this with a window function (which can't be used in a WHERE clause). status
does not equal the old status
Here's the best I've come up with so far:
INSERT INTO old (key, valid_from, status)
SELECT key, valid_from, status FROM (
SELECT
new.key,
new.valid_from,
new.status,
-- Does the new record have a different statusf from the old?
old.status != new.status AS is_different_status,
-- Are we comparing the most "recent" record in the old table?
old.valid_from = MAX(old.valid_from) OVER (PARTITION BY old.key, new.id)
AS is_most_recent
FROM new JOIN old
ON new.key = old.key
AND new.valid_from >= old.valid_from) AS to_insert
WHERE is_different_status AND is_most_recent;
I'd love to know if there's a better solution out there.