We've been collecting data for some time now, and using timestamptz fields. I made a mistake, I should have used timestamp. What we do is to collect a lot of data from different locations and then calculate UTC ourselves before pushing the data to Postgres. As I understand it, timestamp and timestamptz data are the same 8 bytes either way, what timestamptz gives you is magical (and invisible) AT TIME ZONE
conversions. Meaning, the data is not different, it's how Postgres treats that data that's different. In our case, this means we're screwing up by pushing the data into Postgres as UTC and then pulling it out again to local. Our server's data doesn't have one time zone, which is why we set it to UTC internally, like Postgres does. To make reporting simpler, analytic tables typically have a redundant column for local_dts and utc_dts. This way, we can run reports that compare "Monday mornings from 8-11" across facilities in different time zones. Different facilities have different time zones, so we use the "local" value, that's their local for such queries. But if we need a unified time line, then we use UTC. Put simply: Rows in the same table may be from sources with different time zones.
Okay, that's background, I've now got 10s of millions of rows that I'm looking at updating. The structure modifications look simple:
-- Change the data type, this is instantaneous.
ALTER TABLE assembly
ALTER COLUMN created_dts
SET DATA TYPE timestamp;
-- Reset the default, it's probably not necessary, but the ::timestamptz is misleading/confusing here otherwise.
ALTER TABLE assembly
ALTER COLUMN created_dts
SET DEFAULT '-infinity'::timestamp
I'll have to drop and recreate a few views, but that's just a matter of running some backup scripts.
My question is how to do the update effectively without dragging down the server? I'm imagining batching things by 5K rows at a time, or the like. For the sake of simplicity, say that all of our servers are set to US/Central. When we pushed data in originally as UTC, it was converted by Postgres again, so now the data is off by the offset between the our server's time and UTC. (I think.) If so, the simplest update might look like this:
SET TIME ZONE 'UTC'; -- Tell Postgres we're in UTC to line up the data with the UTC clock it's set to.
UPDATE analytic_scan
SET created_dts = created_dts at time zone 'US/Central' -- Tell Postgres to convert the value back to where we started.
That seems to work (?), leaving out the obvious omission of dealing with Daylight Savings Time. I could add a WHERE
clause to handle that, but it doesn't change my question. And now the question, I've got record counts like these:
analytic_productivity 728,708
analytic_scan 4,296,273
analytic_sterilizer_load 136,926
analytic_sterilizer_loadinv 327,700
record_changes_log 17,949,132
So, not massive, but not nothing. Is there a way to sensibly slice the data in SQL so that
All of the tables have a UUID ID PK field, a couple have a generated identity column, like a snipped from this reporting table:
CREATE TABLE IF NOT EXISTS "data"."analytic_productivity" (
"id" uuid NOT NULL DEFAULT NULL,
"pg_con_id" integer GENERATED BY DEFAULT AS IDENTITY UNIQUE,
"data_file_id" uuid NOT NULL DEFAULT NULL,
"start_utc" timestamptz NOT NULL DEFAULT '-infinity',
"start_local" timestamptz NOT NULL DEFAULT '-infinity',
"end_utc" timestamptz NOT NULL DEFAULT '-infinity',
"end_local" timestamptz NOT NULL DEFAULT '-infinity')
One idea I had is to use a substring or hash of the UUID::text
to make smaller batches:
select * from analytic_sterilizer_loadinv
where left(id::text,1) = 'a'
That seems slow and horrible. A hash seems a bit better:
select abs(hashtext(id::text)) % 64,
count(*)
from analytic_sterilizer_loadinv
The bucket sizes aren't that even, but it might be good enough, and I can increase the number of buckets, if needed. Unfortunately, I don't know how to run my code in a loop in SQL using the buckets. If someone should point out how, I'd be grateful. And, if there's a simple built-in chunking feature, I'd love to know about that.
I have not thought through the clear problem of how to deal with incoming data that's going to get caught up in the modifications, short of locking the whole table. Which I might be able to do.
If you can afford it, don't do the UPDATE
in batches, but all at once. The main disadvantage is that that will bloat the tables, and you should run VACUUM (FULL)
on the tables afterwards, which will cause down time.
I would write client code to do the update in batches, for example in bash:
typeset -i part=0
# PostgreSQL client time zone
export PGTZ=UTC
while [ $part -lt 64 ]
do
psql <<-EOF
UPDATE data.analytic_productivity
SET created_dts = created_dts at time zone 'US/Central'
WHERE abs(hashtext(id::text)) % 64 = '$part'
EOF
psql -c "VACUUM data.analytic_productivity"
part=part+1
done