I have a Postgres 11 table like so :
CREATE TABLE schema.foo_numbers (
id INTEGER PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
created_at TIMESTAMP WITH TIME ZONE DEFAULT now(),
quantity INTEGER,
category TEXT
);
It has some data like :
id | created_at | quantity | category
----+------------------------+----------+----------
1 | 2020-01-01 12:00:00+00 | 2 | a
2 | 2020-01-02 17:00:00+00 | 1 | b
3 | 2020-01-01 15:00:00+00 | 6 | a
4 | 2020-01-04 09:00:00+00 | 1 | b
5 | 2020-01-05 19:00:00+00 | 2 | a
6 | 2020-01-06 23:00:00+00 | 8 | b
7 | 2020-01-07 20:00:00+00 | 1 | a
8 | 2020-01-08 04:00:00+00 | 2 | b
9 | 2020-01-09 23:00:00+00 | 1 | a
10 | 2020-01-10 19:00:00+00 | 1 | b
11 | 2020-01-11 05:00:00+00 | 1 | a
12 | 2020-01-12 21:00:00+00 | 1 | b
13 | 2020-01-13 01:00:00+00 | 1 | a
14 | 2020-01-14 18:00:00+00 | 1 | b
I have another table which tracks certain properties of foo categories:
create table schema.foo_category_properties (
id INTEGER PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
category TEXT NOT NULL,
some_bool BOOLEAN NOT NULL DEFAULT FALSE
);
This table has data like so:
id | category | some_bool
----+----------+-----------
1 | a | f
2 | b | f
I need to create a postgres function (to be called from app logic via the postgREST api) that will, for an argument $TIMESTAMP , return the most recent record for each category with created_at <= $TIMESTAMP .
Ideally, the incoming argument will be treated as TIMESTAMP WITH TIME ZONE AT TIME ZONE ‘America/Los_Angeles’ , and the function returns the latest record with its time stamp shown at that same time zone -- however, it’s also OK if that’s not possible and all timestamps remain in UTC [to be offset in app logic], provided the correct data is returned in a consistent fashion.
Server time is set to UTC:
psql => show time zone;
TimeZone
----------
UTC
(1 row)
The postgres function I’ve written is like so :
CREATE OR REPLACE FUNCTION schema.foo_proc (end_date TEXT)
RETURNS TABLE (
id INTEGER,
category TEXT,
quantity BIGINT,
snapshot_count NUMERIC,
latest_entry TIMESTAMP WITH TIME ZONE
)
AS $$
#variable_conflict use_column
BEGIN
RETURN QUERY
SELECT
alias1.id,
alias1.category,
alias1.quantity,
alias1.snapshot_count,
alias2.latest_entry AS latest_entry
FROM
(
SELECT
id,
category,
quantity,
sum(quantity) OVER (partition by category ORDER BY created_at) AS snapshot_count
FROM
schema.foo_numbers
) AS alias1
INNER JOIN
(
SELECT
max(id) AS id,
category,
max(created_at AT TIME ZONE 'America/Los_Angeles') AS latest_entry
from
schema.foo_numbers
WHERE created_at AT TIME ZONE 'America/Los_Angeles' <= to_timestamp($1', 'YYYY-MM-DD HH24:MI:SS') :: TIMESTAMPTZ AT TIME ZONE 'America/Los_Angeles'
group by category
order by category
) AS alias2
ON
alias1.id = alias2.id
INNER JOIN
schema.foo_category_properties fcp
ON
alias2.category = fcp.category
WHERE fcp.some_bool IS FALSE
ORDER BY
alias1.category
;
END;
$$ LANGUAGE plpgsql;
Here is the data in foo_numbers
with timestamps shifted to time zone ‘America/Los_Angeles’
psql=> select id, created_at at time zone 'america/los_angeles', quantity, category from schemai.foo_numbers order by created_at;
id | timezone | quantity | category
----+---------------------+----------+----------
1 | 2020-01-01 04:00:00 | 2 | a
3 | 2020-01-01 07:00:00 | 6 | a
2 | 2020-01-02 09:00:00 | 1 | b
4 | 2020-01-04 01:00:00 | 1 | b
5 | 2020-01-05 11:00:00 | 2 | a
6 | 2020-01-06 15:00:00 | 8 | b
7 | 2020-01-07 12:00:00 | 1 | a
8 | 2020-01-07 20:00:00 | 2 | b
9 | 2020-01-09 15:00:00 | 1 | a
10 | 2020-01-10 11:00:00 | 1 | b
11 | 2020-01-10 21:00:00 | 1 | a
12 | 2020-01-12 13:00:00 | 1 | b
13 | 2020-01-12 17:00:00 | 1 | a
14 | 2020-01-14 10:00:00 | 1 | b
(14 rows)
The expected output for argument:
"end_date":"2020-01-07 19:00:00"
would be
id | category | quantity | snapshot_count | latest_entry
----+----------+----------+----------------+------------------------
6 | b | 8 | 10 | 2020-01-06 15:00:00
7 | a | 1 | 11 | 2020-01-07 12:00:00
(2 rows)
However, actual output for the same argument is :
id | category | quantity | snapshot_count | latest_entry
----+----------+----------+----------------+------------------------
5 | a | 2 | 10 | 2020-01-05 19:00:00+00
6 | b | 8 | 10 | 2020-01-06 23:00:00+00
(2 rows)
Similar unexpected results happen with the argument cast to timestamptz at UTC.
In all variations I’ve tried, the returned rows don’t correctly match the argument boundary.
Clearly, I’m failing to understand something about how time zones are handled in PG-- I have read the official docs in detail and a fair few related questions on SO, and also on the PG forum where the to_timestamp() function is discussed, but haven’t been able to get the correct results after considerable trial-and-error.
All guidance is much appreciated!
You can use distinct on
and proper timezone translation:
select distinct on (n.category)
n.id,
n.created_at at time zone 'America/Los_Angeles' at time zone 'utc' created_at,
n.quantity,
n.category,
sum(quantity)
over (partition by n.category order by n.created_at) as snapshot_count
from foo_numbers n
inner join foo_category_properties cp on cp.category = n.category
where n.created_at <= '2020-01-07 19:00:00'::timestamp with time zone
at time zone 'utc' at time zone 'America/Los_Angeles'
order by n.category, n.created_at desc
id | created_at | quantity | category | snapshot_count -: | :--------------------- | -------: | :------- | -------------: 7 | 2020-01-07 12:00:00+00 | 1 | a | 11 6 | 2020-01-06 15:00:00+00 | 8 | b | 10