I am writing a PostgreSQL trigger function that will allow me to do an ETL transformation on a data.
I have a table with fields (id
, price
, EFF_FROM
, EFF_TO
), and if I only insert (id
, price
, EFF_FROM
), EFF_TO
needs to be calculated automatically as the next EFF_FROM
of this id
(-1 day) if it exists, and "5999-12-31" if it does not.
To update the table, I am using UPDATE ... FROM
inside of a trigger function, but it seems to update all EFF_TO
values in the table with the last given EFF_FROM
value.
The trigger function that I use looks like this:
CREATE OR REPLACE FUNCTION prices_schema.prices_etl() RETURNS TRIGGER AS $$
BEGIN
UPDATE prices_schema.prices
SET EFF_TO = subquery.next_eff_from
FROM (
SELECT COALESCE(
LEAD(EFF_FROM, 1)
OVER (
ORDER BY EFF_FROM ASC
),
TO_DATE('6000-00-00', 'YYYY-MM-DD')
) - 1 AS next_eff_from FROM prices_schema.prices
) AS subquery;
RETURN NULL;
END;
$$ LANGUAGE plpgsql;
CREATE OR REPLACE TRIGGER after_insert_prices
AFTER INSERT ON prices_schema.prices
FOR EACH ROW EXECUTE PROCEDURE prices_schema.prices_etl();
Load some test data:
INSERT INTO prices_schema.prices(id, Price, EFF_FROM)
VALUES(1, 100, '2017-01-12');
INSERT INTO prices_schema.prices(id, Price, EFF_FROM)
VALUES(1, 150, '2017-02-09');
INSERT INTO prices_schema.prices(id, Price, EFF_FROM)
VALUES(1, 125, '2017-01-27');
Execute:
SELECT * FROM prices_schema.prices
ORDER BY EFF_FROM ASC;
Result:
id | price | EFF_FROM | EFF_TO |
---|---|---|---|
1 | 100 | 2017-01-12 | 2017-01-26 |
1 | 125 | 2017-01-27 | 2017-01-26 |
1 | 150 | 2017-02-09 | 2017-01-26 |
EFF_TO
obviously doesn't look as expected to. What amazes me more is that calling this query:
SELECT eff_from, COALESCE(
LEAD(EFF_FROM, 1)
OVER (
ORDER BY EFF_FROM ASC
),
TO_DATE('6000-00-00', 'YYYY-MM-DD')
) - 1 AS next_eff_from FROM prices_schema.prices;
Gives me the correct result:
EFF_FROM | next_eff_from |
---|---|
2017-01-12 | 2017-01-26 |
2017-01-27 | 2017-02-08 |
2017-02-09 | 5999-12-31 |
I bet this has something to do with how triggers & trigger functions work, but I wasn't able to find the right thing. I would like to be pointed out in the right direction.
Turns out, I was only 2 lines away from one:
CREATE OR REPLACE FUNCTION prices_schema.prices_etl() RETURNS TRIGGER AS $$
BEGIN
UPDATE prices_schema.prices p0
SET eff_to = subquery.next_eff_from
FROM (
SELECT id, price, COALESCE(
LEAD(eff_from, 1)
OVER (
PARTITION BY id
ORDER BY eff_from ASC
),
TO_DATE('6000-00-00', 'YYYY-MM-DD')
) - 1 AS next_eff_from FROM prices_schema.prices
) AS subquery
WHERE subquery.id = p0.id AND subquery.price = p0.price;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE OR REPLACE TRIGGER after_insert_prices
AFTER INSERT ON prices_schema.prices
FOR EACH ROW EXECUTE PROCEDURE prices_schema.prices_etl();
This now works as intended, even if multiple different id values are inserted. Execute:
INSERT INTO prices_schema.prices(id, Price, EFF_FROM)
VALUES(1, 100, '2017-01-12');
INSERT INTO prices_schema.prices(id, Price, EFF_FROM)
VALUES(1, 150, '2017-02-09');
INSERT INTO prices_schema.prices(id, Price, EFF_FROM)
VALUES(1, 125, '2017-01-27');
INSERT INTO prices_schema.prices(id, Price, EFF_FROM)
VALUES (2, 900, '2016-05-05');
SELECT * FROM prices_schema.prices;
Result:
id | price | eff_from | eff_to |
---|---|---|---|
1 | 100 | 2017-01-12 | 2017-01-26 |
1 | 150 | 2017-02-09 | 5999-12-31 |
1 | 125 | 2017-01-27 | 2017-02-08 |
2 | 900 | 2016-05-05 | 5999-12-31 |
Yes, I know it probably is still heavily unefficient, but it works. It seems to work for bulk inserts, too, but it wasn't my primary focus anyway since I'm inserting data via psycopg2, parsing .json files.
What's more important is that it correctly updates two rows, if two are affected by an insert. Ex.: when a new row with price
= 125 is inserted, eff_from
fields need to be updated both for this row and for row with price
= 100, since we're inserting in between the time intervals.