I need to update a column in a PostGIS database.
I want to do that with this:
UPDATE
cj_geometry
SET
groundgeometry = CASE WHEN NOT ST_IsValid(groundgeometry) THEN ST_Multi(
ST_CollectionExtract(
ST_MakeValid(groundgeometry),
3
)
) ELSE ST_MakeValid(groundgeometry) END
It's around 65.000.000 lines.
I let it run for an hour or so, then realized that there is no space left on device, so I quit the query and restarted the database. It did restart, but it's now super big, without new data. So two questions:
Postgres does not release storage back to the OS by default. Databases tend only to grow, so there's not much call to release storage back. Postgres keeps this storage around so that future inserts and updates and simply reuse space that points to dead tuples. This is faster than asking the OS to extend the size of the file.
To reclaim assigned storage you can do a full vacuum. This essentially transfers the table to a new file, compacting the backing-file to its smallest possible size. This requires that you have enough space to store the old backing-file and the new-backing file simultaneously. So if you are very low on disk space, then this may not be an option.
VACUUM (FULL) mytable;
If you cannot do a full vacuum because of limited disk space, then your only options are to dump and reload the table, or to transfer the table to and from a tablespace on a different disk (mentioned in more detail below). Other the two, dump and reload is perhaps the simpler option. You need to dump/export the table to a different host or partition, truncate/drop the table, and then load/import the table again. This may require you to (temporarily) drop any foreign key constraints that point to your table. As such, to keep database integrity, you must stop and prevent any other users from accessing the database whilst you are performing this process. Once you have restored all the dropped constraints you can re-enable access to the database.
If you do not have enough space on the current disk that hosts your postgres data directory, but you do have a separate disk on your host that does have enough space, then you can use it to trim the table without the data ever leaving the postgres server. This approach has the benefit of postgres being able to maintain its ACID guarantees without any manual assistance from you. To do this you need to:
postgres
user)ALTER TABLE
to use this new tablespace. This will transfer the table to your other disk, compacting it as it goes.Mount the disk and create a directory on the mount point that is owned by postgres
. Then make the directory available as a tablespace in postgres:
CREATE TABLESPACE recoveryspace LOCATION '/path/to/mount/postgresql/data';
This is a big operation, and access to the table will be blocked whilst this operation is ongoing.
-- send table to disk with free space.
-- this compacts the table, and releases the space on your primary disk
ALTER TABLE mytable SET TABLESPACE recoveryspace;
-- send the table back to your primary tablespace/disk
ALTER TABLE mytable SET TABLESPACE pg_default;
Once you have transferred the table back to its original tablespace, you can drop the tablespace and then unmount the disk.
DROP TABLESPACE recoveryspace;
To avoid this problem in future you can do incremental updates. This reduces the total number of reachable tuples at any given time, and so reduces the maximum size of the backing-file. That is, when you update every row in a table, postgres must be able to access every before and after row at the point you commit the transaction. This means that the backing file will have to be twice its normal size to accommodate all these rows.
If you only update 1 million rows or so at time, then the total size of the backing file only needs to be large enough to accomodate 66 million tuples (65m + 1m). When you update the next batch, there will be 1 million tuples that are no longer reachable, and postgres can safely reuse this space. You will likely have to tell postgres that it should recycle these unreachable tuples though. For this technique to work, there must be no long running transactions that mean that the old rows are still reachable.
BEGIN;
UPDATE mytable SET mycol = newval WHERE <criteria>;
COMMIT;
VACUUM; -- recycles unreachable tuples, but retains file storage
<criteria>
should be able to exclude previously updated rows. A simple criteria is to use a unique column or primary key on the table. eg.
UPDATE mytable
SET mycol = f(x)
WHERE id in (
SELECT id FROM mytable WHERE 0 <= id 1000000
);
-- and then on the next update
WHERE 1000000 <= id < 2000000
You can run this update in a loop (commiting and vacuuming between each update), changing the ids to update each time and until all rows have been successfully updated.
This requires you to keep track of which ids have been updated. If you can inspect a row and tell if it needs updating then you do not need to keep track of the id column between queries. Instead use that in the where clause when selecting which ids to update.
UPDATE mytable
SET mycol = f(x)
WHERE id in (
SELECT id FROM mytable
-- exclude rows that have been successfully updated
WHERE is_invalid(mycol)
LIMIT 1000000
)
If there is no unique identifier on the table then you can use the system column ctid
to help you limit which rows are selected for update. This will require you to be able to tell if a row needs updating or not. If you cannot then simply add a temporary boolean column (with no default) and set to true when updated. At the end of the updates you can then drop the column. This will lead to some bloat in table, but this will disappear over time as the rows get updated.
UPDATE
mytable
SET
mycol = f(x)
WHERE
ctid IN (
SELECT ctid FROM mytable
WHERE is_invalid(mycol)
LIMIT 1000000
);
ALTER TABLE mytable ADD COLUMN updated boolean;
BEGIN;
UPDATE
mytable
SET
mycol = f(x), updated = true
WHERE
ctid IN (
SELECT ctid FROM mytable
WHERE updated IS NULL
LIMIT 1000000
);
COMMIT;
VACUUM;
-- repeat update, commit and vacuum until done
-- clean up table
ALTER TABLE mytable DROP COLUMN updated;