postgresql timestamp alter-table timestamp-with-timezone

Chunking a table for a batch update of timestamptz to timestamp

Short version

Postgres 11.4 deployed on RDS.
Is there a built-in, or straightforward way to split up rows in a table for batch update?
Once you've got a bucket scheme, how do you run a loop in SQL to process each bucket with a bit of a pause for the server to catch its breath?
Is it even necessary to batch the work, or am I worrying for no good reason?

Detailed version:

We've been collecting data for some time now, and using timestamptz fields. I made a mistake, I should have used timestamp. What we do is to collect a lot of data from different locations and then calculate UTC ourselves before pushing the data to Postgres. As I understand it, timestamp and timestamptz data are the same 8 bytes either way, what timestamptz gives you is magical (and invisible) AT TIME ZONE conversions. Meaning, the data is not different, it's how Postgres treats that data that's different. In our case, this means we're screwing up by pushing the data into Postgres as UTC and then pulling it out again to local. Our server's data doesn't have one time zone, which is why we set it to UTC internally, like Postgres does. To make reporting simpler, analytic tables typically have a redundant column for local_dts and utc_dts. This way, we can run reports that compare "Monday mornings from 8-11" across facilities in different time zones. Different facilities have different time zones, so we use the "local" value, that's their local for such queries. But if we need a unified time line, then we use UTC. Put simply: Rows in the same table may be from sources with different time zones.

Okay, that's background, I've now got 10s of millions of rows that I'm looking at updating. The structure modifications look simple:

-- Change the data type, this is instantaneous.
ALTER TABLE assembly
   ALTER COLUMN created_dts 
   SET DATA TYPE timestamp;
   
-- Reset the default, it's probably not necessary, but the ::timestamptz is misleading/confusing here otherwise.
ALTER TABLE assembly
   ALTER COLUMN created_dts 
   SET DEFAULT '-infinity'::timestamp

I'll have to drop and recreate a few views, but that's just a matter of running some backup scripts.

My question is how to do the update effectively without dragging down the server? I'm imagining batching things by 5K rows at a time, or the like. For the sake of simplicity, say that all of our servers are set to US/Central. When we pushed data in originally as UTC, it was converted by Postgres again, so now the data is off by the offset between the our server's time and UTC. (I think.) If so, the simplest update might look like this:

SET TIME ZONE 'UTC'; -- Tell Postgres we're in UTC to line up the data with the UTC clock it's set to.
UPDATE analytic_scan 
  SET created_dts = created_dts at time zone 'US/Central' -- Tell Postgres to convert the value back to where we started.

That seems to work (?), leaving out the obvious omission of dealing with Daylight Savings Time. I could add a WHERE clause to handle that, but it doesn't change my question. And now the question, I've got record counts like these:

analytic_productivity           728,708
analytic_scan                 4,296,273
analytic_sterilizer_load        136,926
analytic_sterilizer_loadinv     327,700
record_changes_log           17,949,132

So, not massive, but not nothing. Is there a way to sensibly slice the data in SQL so that

Each row is updated once
No row is updated more than once
Not too many rows are updated at one time

All of the tables have a UUID ID PK field, a couple have a generated identity column, like a snipped from this reporting table:

CREATE TABLE IF NOT EXISTS "data"."analytic_productivity" (
    "id" uuid NOT NULL DEFAULT NULL,
    "pg_con_id" integer GENERATED BY DEFAULT AS IDENTITY UNIQUE,
    "data_file_id" uuid NOT NULL DEFAULT NULL,
    "start_utc" timestamptz NOT NULL DEFAULT '-infinity',
    "start_local" timestamptz NOT NULL DEFAULT '-infinity',
    "end_utc" timestamptz NOT NULL DEFAULT '-infinity',
    "end_local" timestamptz NOT NULL DEFAULT '-infinity')

One idea I had is to use a substring or hash of the UUID::text to make smaller batches:

select * from analytic_sterilizer_loadinv 
  where left(id::text,1) = 'a'

That seems slow and horrible. A hash seems a bit better:

select abs(hashtext(id::text))  % 64,
       count(*)
       
  from analytic_sterilizer_loadinv

The bucket sizes aren't that even, but it might be good enough, and I can increase the number of buckets, if needed. Unfortunately, I don't know how to run my code in a loop in SQL using the buckets. If someone should point out how, I'd be grateful. And, if there's a simple built-in chunking feature, I'd love to know about that.

I have not thought through the clear problem of how to deal with incoming data that's going to get caught up in the modifications, short of locking the whole table. Which I might be able to do.

Solution

If you can afford it, don't do the UPDATE in batches, but all at once. The main disadvantage is that that will bloat the tables, and you should run VACUUM (FULL) on the tables afterwards, which will cause down time.

I would write client code to do the update in batches, for example in bash:

typeset -i part=0

# PostgreSQL client time zone
export PGTZ=UTC

while [ $part -lt 64 ]
do
    psql <<-EOF
        UPDATE data.analytic_productivity
        SET created_dts = created_dts at time zone 'US/Central'
        WHERE abs(hashtext(id::text)) % 64 = '$part'
EOF
    psql -c "VACUUM data.analytic_productivity"

    part=part+1
done