Search code examples
postgresqlpostgis

Why does updating a table in PostgreSQL need disk storage?


I need to update a column in a PostGIS database.

I want to do that with this:

UPDATE 
  cj_geometry 
SET 
  groundgeometry = CASE WHEN NOT ST_IsValid(groundgeometry) THEN ST_Multi(
    ST_CollectionExtract(
      ST_MakeValid(groundgeometry), 
      3
    )
  ) ELSE ST_MakeValid(groundgeometry) END

It's around 65.000.000 lines.

I let it run for an hour or so, then realized that there is no space left on device, so I quit the query and restarted the database. It did restart, but it's now super big, without new data. So two questions:

  1. How do I get it back to its original size?
  2. Why does this use so much disk space? How can I avoid this?

Solution

  • Postgres does not release storage back to the OS by default. Databases tend only to grow, so there's not much call to release storage back. Postgres keeps this storage around so that future inserts and updates and simply reuse space that points to dead tuples. This is faster than asking the OS to extend the size of the file.

    Reclaiming Disk Space

    To reclaim assigned storage you can do a full vacuum. This essentially transfers the table to a new file, compacting the backing-file to its smallest possible size. This requires that you have enough space to store the old backing-file and the new-backing file simultaneously. So if you are very low on disk space, then this may not be an option.

    VACUUM (FULL) mytable;
    

    If you cannot do a full vacuum because of limited disk space, then your only options are to dump and reload the table, or to transfer the table to and from a tablespace on a different disk (mentioned in more detail below). Other the two, dump and reload is perhaps the simpler option. You need to dump/export the table to a different host or partition, truncate/drop the table, and then load/import the table again. This may require you to (temporarily) drop any foreign key constraints that point to your table. As such, to keep database integrity, you must stop and prevent any other users from accessing the database whilst you are performing this process. Once you have restored all the dropped constraints you can re-enable access to the database.

    Tablespaces

    If you do not have enough space on the current disk that hosts your postgres data directory, but you do have a separate disk on your host that does have enough space, then you can use it to trim the table without the data ever leaving the postgres server. This approach has the benefit of postgres being able to maintain its ACID guarantees without any manual assistance from you. To do this you need to:

    1. Mount the disk
    2. Initialise a data directory on the mount point (owned by the postgres user)
    3. Create a tablespace that points to this directory
    4. ALTER TABLE to use this new tablespace. This will transfer the table to your other disk, compacting it as it goes.
    5. Transfer the table back to the original tablespace
    6. Drop the tablespace, and unmount the disk.

    Creating the Tablespace

    Mount the disk and create a directory on the mount point that is owned by postgres. Then make the directory available as a tablespace in postgres:

    CREATE TABLESPACE recoveryspace LOCATION '/path/to/mount/postgresql/data';
    

    Compacting the Table

    This is a big operation, and access to the table will be blocked whilst this operation is ongoing.

    -- send table to disk with free space.
    -- this compacts the table, and releases the space on your primary disk
    ALTER TABLE mytable SET TABLESPACE recoveryspace;
    -- send the table back to your primary tablespace/disk
    ALTER TABLE mytable SET TABLESPACE pg_default;
    

    Delete the Tablespace

    Once you have transferred the table back to its original tablespace, you can drop the tablespace and then unmount the disk.

    DROP TABLESPACE recoveryspace;
    

    How to Avoid Using Too Much Disk Space

    To avoid this problem in future you can do incremental updates. This reduces the total number of reachable tuples at any given time, and so reduces the maximum size of the backing-file. That is, when you update every row in a table, postgres must be able to access every before and after row at the point you commit the transaction. This means that the backing file will have to be twice its normal size to accommodate all these rows.

    If you only update 1 million rows or so at time, then the total size of the backing file only needs to be large enough to accomodate 66 million tuples (65m + 1m). When you update the next batch, there will be 1 million tuples that are no longer reachable, and postgres can safely reuse this space. You will likely have to tell postgres that it should recycle these unreachable tuples though. For this technique to work, there must be no long running transactions that mean that the old rows are still reachable.

    BEGIN;
    UPDATE mytable SET mycol = newval WHERE <criteria>;
    COMMIT;
    VACUUM;  -- recycles unreachable tuples, but retains file storage
    

    <criteria> should be able to exclude previously updated rows. A simple criteria is to use a unique column or primary key on the table. eg.

    UPDATE mytable
        SET mycol = f(x)
        WHERE id in (
            SELECT id FROM mytable WHERE 0 <= id 1000000
        );
    -- and then on the next update
    WHERE 1000000 <= id < 2000000
    

    You can run this update in a loop (commiting and vacuuming between each update), changing the ids to update each time and until all rows have been successfully updated.

    This requires you to keep track of which ids have been updated. If you can inspect a row and tell if it needs updating then you do not need to keep track of the id column between queries. Instead use that in the where clause when selecting which ids to update.

    UPDATE mytable
        SET mycol = f(x)
        WHERE id in (
            SELECT id FROM mytable
            -- exclude rows that have been successfully updated
            WHERE is_invalid(mycol)
            LIMIT 1000000
        )
    
    

    If there is no unique identifier on the table then you can use the system column ctid to help you limit which rows are selected for update. This will require you to be able to tell if a row needs updating or not. If you cannot then simply add a temporary boolean column (with no default) and set to true when updated. At the end of the updates you can then drop the column. This will lead to some bloat in table, but this will disappear over time as the rows get updated.

    UPDATE 
        mytable
    SET 
        mycol = f(x)
    WHERE
        ctid IN (
            SELECT ctid FROM mytable
            WHERE is_invalid(mycol)
            LIMIT 1000000
        );
    

    Using a temporary column

    ALTER TABLE mytable ADD COLUMN updated boolean;
    
    BEGIN;
    UPDATE 
        mytable
    SET 
        mycol = f(x), updated = true
    WHERE
        ctid IN (
            SELECT ctid FROM mytable
            WHERE updated IS NULL
            LIMIT 1000000
        );
    COMMIT;
    VACUUM;
    -- repeat update, commit and vacuum until done
    
    -- clean up table
    ALTER TABLE mytable DROP COLUMN updated;