Search code examples
sqlpostgresqldategreatest-n-per-grouptimestamp-with-timezone

How can I return the most recent rows at or before $TIMESTAMP at a certain time zone, via a Postgres 11 function (stored proc)?


I have a Postgres 11 table like so :

CREATE TABLE schema.foo_numbers (

 id INTEGER PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
 created_at TIMESTAMP WITH TIME ZONE DEFAULT now(),
 quantity INTEGER,
 category TEXT
);

It has some data like :

 id |       created_at                  | quantity | category 
----+------------------------+----------+----------
  1 | 2020-01-01 12:00:00+00 |        2    | a
  2 | 2020-01-02 17:00:00+00 |        1    | b
  3 | 2020-01-01 15:00:00+00 |        6    | a
  4 | 2020-01-04 09:00:00+00 |        1    | b
  5 | 2020-01-05 19:00:00+00 |        2    | a
  6 | 2020-01-06 23:00:00+00 |        8    | b
  7 | 2020-01-07 20:00:00+00 |        1    | a
  8 | 2020-01-08 04:00:00+00 |        2    | b
  9 | 2020-01-09 23:00:00+00 |        1    | a
 10 | 2020-01-10 19:00:00+00 |        1   | b
 11 | 2020-01-11 05:00:00+00 |        1    | a
 12 | 2020-01-12 21:00:00+00 |        1   | b
 13 | 2020-01-13 01:00:00+00 |        1   | a
 14 | 2020-01-14 18:00:00+00 |        1   | b

I have another table which tracks certain properties of foo categories:

create table schema.foo_category_properties (
id INTEGER PRIMARY KEY GENERATED ALWAYS AS IDENTITY, 
category TEXT NOT NULL,  
some_bool BOOLEAN NOT NULL DEFAULT FALSE
); 

This table has data like so:

 id | category | some_bool 
----+----------+-----------
  1 | a        | f
  2 | b        | f

I need to create a postgres function (to be called from app logic via the postgREST api) that will, for an argument $TIMESTAMP , return the most recent record for each category with created_at <= $TIMESTAMP .

Ideally, the incoming argument will be treated as TIMESTAMP WITH TIME ZONE AT TIME ZONE ‘America/Los_Angeles’ , and the function returns the latest record with its time stamp shown at that same time zone -- however, it’s also OK if that’s not possible and all timestamps remain in UTC [to be offset in app logic], provided the correct data is returned in a consistent fashion.

Server time is set to UTC:

psql => show time zone; 
 TimeZone 
----------
 UTC
(1 row)

The postgres function I’ve written is like so :

CREATE OR REPLACE FUNCTION schema.foo_proc (end_date TEXT)   

  RETURNS TABLE (
      id INTEGER,
      category TEXT,
      quantity BIGINT,
      snapshot_count NUMERIC,
      latest_entry TIMESTAMP WITH TIME ZONE 
  )
  AS $$
  #variable_conflict use_column

BEGIN
RETURN QUERY 

    SELECT 
    alias1.id,
    alias1.category, 
    alias1.quantity,
    alias1.snapshot_count,
    alias2.latest_entry AS latest_entry

    FROM
      (
          SELECT 
          id,
          category,
          quantity,
          sum(quantity) OVER (partition by category ORDER BY created_at) AS snapshot_count 
          FROM 
          schema.foo_numbers 
      ) AS alias1
    INNER JOIN 
    (
       SELECT
        max(id) AS id,
        category, 
        max(created_at AT TIME ZONE 'America/Los_Angeles') AS latest_entry
        from 
        schema.foo_numbers 
        WHERE created_at AT TIME ZONE 'America/Los_Angeles' <= to_timestamp($1', 'YYYY-MM-DD HH24:MI:SS') :: TIMESTAMPTZ AT TIME ZONE 'America/Los_Angeles'
        group by category
        order by category 
    ) AS alias2
    ON 
    alias1.id = alias2.id
    INNER JOIN 
    schema.foo_category_properties fcp 
    ON 
    alias2.category = fcp.category
    WHERE fcp.some_bool IS FALSE
    ORDER BY 
    alias1.category 
    ;
END;
$$ LANGUAGE plpgsql;

Here is the data in foo_numbers with timestamps shifted to time zone ‘America/Los_Angeles’

psql=> select id, created_at at time zone 'america/los_angeles', quantity, category  from schemai.foo_numbers order by created_at;

 id |      timezone               | quantity | category 
----+---------------------+----------+----------
  1 | 2020-01-01 04:00:00 |        2     | a
  3 | 2020-01-01 07:00:00 |        6     | a
  2 | 2020-01-02 09:00:00 |        1     | b
  4 | 2020-01-04 01:00:00 |        1     | b
  5 | 2020-01-05 11:00:00 |        2     | a
  6 | 2020-01-06 15:00:00 |        8     | b
  7 | 2020-01-07 12:00:00 |        1     | a
  8 | 2020-01-07 20:00:00 |        2     | b
  9 | 2020-01-09 15:00:00 |        1     | a
 10 | 2020-01-10 11:00:00 |        1    | b
 11 | 2020-01-10 21:00:00 |        1    | a
 12 | 2020-01-12 13:00:00 |        1    | b
 13 | 2020-01-12 17:00:00 |        1    | a
 14 | 2020-01-14 10:00:00 |        1    | b
(14 rows)

The expected output for argument: "end_date":"2020-01-07 19:00:00"

would be

 id | category | quantity | snapshot_count |      latest_entry      
----+----------+----------+----------------+------------------------
  6 | b        |        8 |             10 | 2020-01-06 15:00:00
  7 | a        |        1 |             11 | 2020-01-07 12:00:00
(2 rows)

However, actual output for the same argument is :

 id | category | quantity | snapshot_count |      latest_entry      
----+----------+----------+----------------+------------------------
  5 | a        |        2 |             10 | 2020-01-05 19:00:00+00
  6 | b        |        8 |             10 | 2020-01-06 23:00:00+00
(2 rows)

Similar unexpected results happen with the argument cast to timestamptz at UTC.

In all variations I’ve tried, the returned rows don’t correctly match the argument boundary.

Clearly, I’m failing to understand something about how time zones are handled in PG-- I have read the official docs in detail and a fair few related questions on SO, and also on the PG forum where the to_timestamp() function is discussed, but haven’t been able to get the correct results after considerable trial-and-error.

All guidance is much appreciated!


Solution

  • You can use distinct on and proper timezone translation:

    select distinct on (n.category)
        n.id,
        n.created_at at time zone 'America/Los_Angeles' at time zone 'utc' created_at,
        n.quantity,
        n.category,
        sum(quantity) 
            over (partition by n.category order by n.created_at) as snapshot_count 
    from foo_numbers n
    inner join foo_category_properties cp on cp.category = n.category
    where n.created_at <= '2020-01-07 19:00:00'::timestamp with time zone 
                              at time zone 'utc' at time zone 'America/Los_Angeles'
    order by n.category, n.created_at desc
    

    Demo on DB Fiddle:

    id | created_at             | quantity | category | snapshot_count
    -: | :--------------------- | -------: | :------- | -------------:
     7 | 2020-01-07 12:00:00+00 |        1 | a        |             11
     6 | 2020-01-06 15:00:00+00 |        8 | b        |             10