Search code examples
postgresqlsql-updatemvcc

Repetitive Postgres updates of arrays leading to bloat?


I am running a Python script which processes time-series data for a number of different metrics, and then writes the results to a Postgres database.

The time-series assumes 40 epochs, stored as a real[40] array column in the database.

When writing the outputs for all 40 epochs to the table in one shot, (batch update across all rows), everything seemed to work fine. i.e.

UPDATE my_table SET
  arr_col_1 = {1, 2, 3, ... 40},
  arr_col_2 = {1, 2, 3, ...40},
  ...
  arr_col_90 = {1, 2, 3, ...40};

However, iteratively writing the results of the respective epochs to each position in the array seems to chew up all free space on the hard drive, e.g.

UPDATE my_table SET
  arr_col_1[1] = 1,
  arr_col_2[1] = 1,
  ...
  arr_col_90[1] = 1;

UPDATE my_table SET
  arr_col_1[2] = 2,
  arr_col_2[2] = 2,
  ...
  arr_col_90[2] = 2;

-- repeat x 38 more times

The reason for the iterative strategy is to accommodate larger quantities of rows, for which the results for 40 epochs don't fit into memory at the same time.

To my knowledge, UPDATE queries will delete and rewrite row data in certain situations, but I'm not clear on when this happens and how this possibly relates to arrays. Is there a way to iteratively update arrays across large numbers of rows without leading to database bloat?


Solution

  • As others have correctly mentioned, this approach is not well suited to PostgreSQL's mode of operation.

    However, you may be able to use an optimization called HOT:

    • Declare your table with a fillfactor less than 100 so that INSERTs leave free space in each block:

      ALTER TABLE my_table SET (fillfactor = 50);
      

      This setting only affects future activity, you'd have to reorganize the table for it to affect existing data. If you update every row in the table, you may need a setting as low as 30 for it to be effective.

    • Make sure the columns that are updated do not have an index on them.

    Then PostgreSQL can use &ldquo ;HOT update” and reclaim the dead table entries on the fly, which avoids the need for autovacuum, which obviously cannot keep up on your table.

    Check the n_tup_hot_upd column in the pg_stat_user_tables row for your table to see if it is working.