Search code examples
postgresqlindexingdatabase-designdiskspace

How to reduce the required storage space for large indexes and large PostgreSQL tables to improve and ensure performance?


The Situation (updated):

I have 3 PostgreSQL tables (PostgreSQL Version 14.9) that are used to store sensor data from currently 2670 data points. The data was written to the database (blocks of data for 3 days) with automatically generated insert statements in the form:

INSERT INTO monitoring_data.time_series_numerical (timestamp, property_id, value) 
VALUES ('timestamp', 'property_id', 'value')
...
ON CONFLICT (timestamp, property_id) DO NOTHING;

This ultimately led to the database becoming slower and slower to the point that new storage space had to be released so that it could be accessed again. The disk space usage of the 3 tables:

oid table_schema table_name number_of_rows total_row_bytes row_bytes total_bytes index_bytes toast_bytes table_bytes total index table toast
22220 monitoring_data time_series 5.5084934e+08 226.11499436726209 97.18495597524046 124555296768 71006248960 NULL 53549047808 116 GB 66 GB 50 GB NULL
22230 monitoring_data time_series_numerical 5.0103974e+08 242.67533732942616 108.90748283270925 121589989376 67007979520 NULL 54582009856 113 GB 62 GB 51 GB NULL
22238 monitoring_data time_series_alphabetical 3.25614e+07 255.89714840902735 110.47100539466943 8332369920 4734271488 8192 3598090240 7946 MB 4515 MB 3431 MB 8192 bytes

The 3 tables via CREATE script from pgAdmin 4:

    CREATE TABLE IF NOT EXISTS monitoring_data.time_series
(
    "timestamp" timestamp without time zone NOT NULL,
    property_id character varying(255) COLLATE pg_catalog."default" NOT NULL,
    CONSTRAINT time_series_pkey PRIMARY KEY ("timestamp", property_id),
    CONSTRAINT time_series_property_id_fkey FOREIGN KEY (property_id)
        REFERENCES monitoring_data.properties (property_id) MATCH SIMPLE
        ON UPDATE CASCADE
        ON DELETE RESTRICT
)
  • varchar(255), e.g. c0022:93EA6CCA3B7D1829_AnalogInput1.status.voltage), which is used as identification for the data point

time_series_numerical:

    CREATE TABLE IF NOT EXISTS monitoring_data.time_series_numerical
(
    -- Inherited from table monitoring_data.time_series: "timestamp" timestamp without time zone NOT NULL,
    -- Inherited from table monitoring_data.time_series: property_id character varying(255) COLLATE pg_catalog."default" NOT NULL,
    value double precision NOT NULL,
    CONSTRAINT time_series_numerical_pkey PRIMARY KEY ("timestamp", property_id),
    CONSTRAINT time_series_numerical_timestamp_property_id_fkey FOREIGN KEY ("timestamp", property_id)
        REFERENCES monitoring_data.time_series ("timestamp", property_id) MATCH SIMPLE
        ON UPDATE NO ACTION
        ON DELETE NO ACTION
)
    INHERITS (monitoring_data.time_series)

time_series_alphabetical: Same as time_series_numerical except the value column

value text COLLATE pg_catalog."default" NOT NULL,

To ensure that the foreign key constraint is fulfilled when writing records to one of the inheriting tables, a trigger was created in conjunction with a function that executes the following function before the INSERT of each record:

BEGIN
    INSERT INTO monitoring_data.time_series (timestamp, property_id)
    SELECT NEW.timestamp, NEW.property_id
    WHERE NOT EXISTS (
        SELECT 1 FROM monitoring_data.time_series 
        WHERE timestamp = NEW.timestamp
        AND property_id = NEW.property_id 
    );
    RETURN NEW;
END;

The time_series table should be used to simplify access to measured values in the underlying tables, regardless of the data type, and to enable all timestamp and property_id entries to be stored together and only once.

As I have already noted, this is not working as expected and has resulted in the data being stored twice (once in the time_series table and once in the respective table for the measurement), and in the time_series table itself, each entry even exists twice.

Query example:

SELECT *
FROM monitoring_data.time_series_numerical
WHERE property_id = 'c0022:9FF441820C58912B_HeatPump.configuration.flowTemperature'  -- Replace <your_property_id> with the actual ID
AND timestamp >= '2024-09-01 00:00:00'
AND timestamp < '2024-09-02 00:00:00'
ORDER BY timestamp;

Result with EXPLAIN (ANALYZE, VERBOSE, BUFFERS, SETTINGS):

"Index Scan using time_series_numerical_pkey on monitoring_data.time_series_numerical  (cost=0.57..1153041.19 rows=4516 width=72) (actual time=0.516..633099.399 rows=5725 loops=1)"
"  Output: ""timestamp"", property_id, value"
"  Index Cond: ((time_series_numerical.""timestamp"" >= '2024-09-01 00:00:00'::timestamp without time zone) AND (time_series_numerical.""timestamp"" < '2024-09-02 00:00:00'::timestamp without time zone) AND ((time_series_numerical.property_id)::text = 'c0022:9FF441820C58912B_HeatPump.configuration.flowTemperature'::text))"
"  Buffers: shared read=199057"
"Settings: search_path = 'monitoring_data'"
"Planning Time: 0.485 ms"
"JIT:"
"  Functions: 2"
"  Options: Inlining true, Optimization true, Expressions true, Deforming true"
"  Timing: Generation 2.138 ms, Inlining 0.000 ms, Optimization 0.000 ms, Emission 0.000 ms, Total 2.138 ms"
"Execution Time: 633108.973 ms"

Before I begin to implement a solution for this problem, I am seeking advice here, as I am just starting to work with PostgreSQL.

Thoughts on solutions and questions:

I currently have two ideas to tackle the problem, but I would like to know what other options I have and what would make sense or be considered best practice for my use case.

  1. Delete all entries from the time_series table, remove the foreign key constraints from both inheriting tables, as well as the triggers and functions, so that access via the time_series table to the inheriting tables is still possible, but significantly less storage space is used.

  2. Use partitioning to split the data across multiple tables based on the timestamp in order to speed up queries.

It is still unclear to me why the indexes take up so much space and how I can reduce that.

Additional info:

PostgreSQL Version 14.9; there are no additional indexes besides the primary keys; all indexes are b-tree indexes; there are no dead tuples; autovacuum is enabled; extensions: postgis, fuzzystrmatch, uuid-ossp; it was decided that time_series_numerical and time_series_alphabetical should be separated


Solution

  • The OP's original design icludes the full property name (called property_id in the post) with each sample. The example name is 62 characters long, which seems to be a reasonable estimate for the mean name length given the table statistics that were provided. Repeating the name with each sample is an extremely inefficient use of space.

    The following demonstrates an approach that reduces the required space by replacing the property name in each sample with a reference to a properties table. This approach results in an order of magnitude improvement in storage efficiency. Some column names have been changed to avoid using PostgreSQL keywords.

    Below are the table definitions (time_series_alphabetical is omitted since it is nearly identical to time_series_numerical):

    CREATE TABLE properties (
      id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
      created_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
      updated_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
      property_name TEXT NOT NULL UNIQUE,
      CONSTRAINT properties_updated_at_ck CHECK (created_at <= updated_at)
    );
    
    CREATE TABLE time_series_numerical (
      sample_timestamp TIMESTAMPTZ,
      property_id BIGINT REFERENCES properties (id),
      created_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
      updated_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
      sample_value DOUBLE PRECISION,
      CONSTRAINT time_series_numerical_pk PRIMARY KEY (property_id, sample_timestamp),
      CONSTRAINT time_series_numerical_updated_at_ck CHECK (created_at <= updated_at)
    );
    

    The audit columns (created_at and updated_at) can be omitted if there isn't a need or desire to track those attributes. Timestamps are declared as TIMESTAMPTZ as recommended at Date/Time storage. The primary key for time_series_numerical is (property_id, sample_timestamp) to facilitate efficient retrieval by property. If retrieval by sample_timestamp is also common, then an additional index should be added with sample_timestamp as the first attribute.

    The following uses a single SQL statement to insert a set of samples with the ability to automatically add properties that have not previously been recorded:

    WITH
      samples AS (
        SELECT
          ts::TIMESTAMPTZ AS sample_timestamp,
          property_name,
          sample_value
        FROM
          (
            VALUES
              ('2024-01-01 00:00:00', 'prop1', 1),
              ('2024-01-01 01:00:00', 'prop1', 2),
              ('2024-01-01 00:00:00', 'prop2', 3),
              ('2024-01-01 00:00:00', 'prop4', 4),
              ('2024-01-01 00:00:00', 'prop3', 5)
          ) v (ts, property_name, sample_value)
      ),
      new_properties AS (
        INSERT INTO properties (property_name)
        SELECT DISTINCT property_name
          FROM samples
          ON CONFLICT DO NOTHING
          RETURNING id, property_name
      )
    INSERT INTO time_series_numerical AS tsn (sample_timestamp, property_id, sample_value)
    SELECT s.sample_timestamp,
           COALESCE(p.id,
                    (SELECT np.id
                       FROM new_properties np
                       WHERE np.property_name = s.property_name)) AS property_id,
           s.sample_value
      FROM samples s
      LEFT JOIN properties p ON p.property_name = s.property_name
      ON CONFLICT (sample_timestamp, property_id)
        DO UPDATE SET sample_value = EXCLUDED.sample_value,
                      updated_at = CURRENT_TIMESTAMP
             WHERE tsn.sample_value IS DISTINCT FROM EXCLUDED.sample_value;
    

    The ON CONFLICT clause updates a previously recorded sample if the value has changed.

    Running the following query demonstrates that each property has only been inserted once:

    SELECT * FROM properties ORDER BY property_name;
    
    id created_at updated_at property_name
    3 2024-09-22 15:16:27.151381-04 2024-09-22 15:16:27.151381-04 prop1
    1 2024-09-22 15:16:27.151381-04 2024-09-22 15:16:27.151381-04 prop2
    2 2024-09-22 15:16:27.151381-04 2024-09-22 15:16:27.151381-04 prop3
    4 2024-09-22 15:16:27.151381-04 2024-09-22 15:16:27.151381-04 prop4

    Samples along with their associated property name can be retrieved with the following query:

    SELECT
      tsn.sample_timestamp,
      tsn.property_id,
      p.property_name,
      tsn.sample_value,
      tsn.created_at,
      tsn.updated_at
    FROM
      time_series_numerical tsn
      JOIN properties p ON p.id = tsn.property_id;
    
    sample_timestamp property_id property_name sample_value created_at updated_at
    2024-01-01 00:00:00-05 3 prop1 1 2024-09-22 15:16:27.151381-04 2024-09-22 15:16:27.151381-04
    2024-01-01 01:00:00-05 3 prop1 2 2024-09-22 15:16:27.151381-04 2024-09-22 15:16:27.151381-04
    2024-01-01 00:00:00-05 1 prop2 3 2024-09-22 15:16:27.151381-04 2024-09-22 15:16:27.151381-04
    2024-01-01 00:00:00-05 4 prop4 4 2024-09-22 15:16:27.151381-04 2024-09-22 15:16:27.151381-04
    2024-01-01 00:00:00-05 2 prop3 5 2024-09-22 15:16:27.151381-04 2024-09-22 15:16:27.151381-04

    To demonstrate the improved storage efficiency, run the following block to generate synthetic data useful for exploring the effects of different indexing options and system settings (on my 2017 MacBook Pro, this took ~1 hour to complete):

    DO $BLOCK$
    <<local>>
    DECLARE
      num_properties CONSTANT INTEGER := 2670;
      property_name_length CONSTANT INTEGER := 50;
      sampling_interval CONSTANT INTERVAL := INTERVAL '15' SECOND;
      missed_sample_probability CONSTANT DOUBLE PRECISION := 0.006;
      sample_date_range CONSTANT TSTZRANGE :=  TSTZRANGE('2024-09-01'::TIMESTAMP AT TIME ZONE 'UTC',
                                                         '2024-09-15'::TIMESTAMP AT TIME ZONE 'UTC');
    BEGIN
      TRUNCATE TABLE time_series_numerical;
      TRUNCATE TABLE properties CASCADE;
      PERFORM SETVAL('properties_id_seq', 1, FALSE);
    
      INSERT INTO properties(property_name)
      SELECT SUBSTRING(REPEAT(ENCODE(UUID_SEND(GEN_RANDOM_UUID()), 'HEX'),
                              CEIL(local.property_name_length / 16.0)::INTEGER),
                       local.property_name_length)
        FROM GENERATE_SERIES(1, local.num_properties) gs(n);
    
      INSERT INTO time_series_numerical(sample_timestamp, property_id, sample_value)
      SELECT sample_times.sample_timestamp, props.id, RANDOM() * 10000
        FROM GENERATE_SERIES(1, local.num_properties) props(id)
        CROSS JOIN GENERATE_SERIES(LOWER(local.sample_date_range),
                                   UPPER(local.sample_date_range) - INTERVAL '0.000001' SECOND,
                                   local.sampling_interval) sample_times(sample_timestamp)
        WHERE random() > local.missed_sample_probability;
    END;
    $BLOCK$ language plpgsql;
    

    This resulted in 2670 rows in properties using ~352 KB for table data and ~440 KB for indexes and 214,017,381 rows in time_series_numerical using ~13.61 GB for table data and ~8.9 GB for indexes (removing the audit columns, created_at and updated_at, reduces the table size by ~3.2 GB).

    Two main factors contributed to the excessively long query time with the OP's original table structure: inefficient index scanning caused by index column order, and excessive block reads caused by low tuple density. Querying for an individual property would be considerably more efficient if the property column appeared first in the index definition since that would reduce the number of block reads required to access all of the related samples. Because the property names are very long, they require significantly more bytes than would be needed by an integer ID reference, which means significantly more block reads (for the OP's case, using an INTEGER reference would reduce the number of block reads by ~16 and using BIGINT by ~8). With the above table structure, query performance is improved by several orders of magnitude as demonstrated by the following:

    EXPLAIN(ANALYZE,BUFFERS,VERBOSE)
    SELECT
      tsn.sample_timestamp,
      tsn.property_id,
      p.property_name,
      tsn.sample_value,
      tsn.created_at,
      tsn.updated_at
    FROM
      time_series_numerical tsn
      JOIN properties p ON p.id = tsn.property_id
    WHERE p.property_name = '80c2bc53040057ac1fca90fa61a492e980c2bc53040057ac1fca90fa61a492e980c2bc53040057a'
      AND tsn.sample_timestamp >= '2024-09-01'::TIMESTAMP
      AND tsn.sample_timestamp < '2024-09-02'::TIMESTAMP;
    
    QUERY PLAN
    Nested Loop  (cost=196.53..21870.23 rows=5622 width=120) (actual time=3.191..11.325 rows=5723 loops=1)
      Output: tsn.sample_timestamp, tsn.property_id, p.property_name, tsn.sample_value, tsn.created_at, tsn.updated_at
      Buffers: shared hit=5768 read=2
      ->  Index Scan using properties_property_name_key on public.properties p  (cost=0.28..8.30 rows=1 width=88) (actual time=0.637..0.641 rows=1 loops=1)
            Output: p.id, p.created_at, p.updated_at, p.property_name
            Index Cond: (p.property_name = '80c2bc53040057ac1fca90fa61a492e980c2bc53040057ac1fca90fa61a492e980c2bc53040057a'::text)
            Buffers: shared hit=1 read=2
      ->  Bitmap Heap Scan on public.time_series_numerical tsn  (cost=196.25..21805.71 rows=5622 width=40) (actual time=2.544..7.690 rows=5723 loops=1)
            Output: tsn.sample_timestamp, tsn.property_id, tsn.created_at, tsn.updated_at, tsn.sample_value
            Recheck Cond: ((tsn.property_id = p.id) AND (tsn.sample_timestamp >= '2024-09-01 00:00:00'::timestamp without time zone) AND (tsn.sample_timestamp < '2024-09-02 00:00:00'::timestamp without time zone))
            Heap Blocks: exact=5723
            Buffers: shared hit=5767
            ->  Bitmap Index Scan on time_series_numerical_pk  (cost=0.00..194.84 rows=5622 width=0) (actual time=1.574..1.575 rows=5723 loops=1)
                  Index Cond: ((tsn.property_id = p.id) AND (tsn.sample_timestamp >= '2024-09-01 00:00:00'::timestamp without time zone) AND (tsn.sample_timestamp < '2024-09-02 00:00:00'::timestamp without time zone))
                  Buffers: shared hit=44
    Planning:
      Buffers: shared hit=78 read=11
    Planning Time: 9.076 ms
    Execution Time: 11.783 ms