postgresql indexing database-design diskspace

How to reduce the required storage space for large indexes and large PostgreSQL tables to improve and ensure performance?

The Situation (updated):

I have 3 PostgreSQL tables (PostgreSQL Version 14.9) that are used to store sensor data from currently 2670 data points. The data was written to the database (blocks of data for 3 days) with automatically generated insert statements in the form:

INSERT INTO monitoring_data.time_series_numerical (timestamp, property_id, value) 
VALUES ('timestamp', 'property_id', 'value')
...
ON CONFLICT (timestamp, property_id) DO NOTHING;

This ultimately led to the database becoming slower and slower to the point that new storage space had to be released so that it could be accessed again. The disk space usage of the 3 tables:

oid	table_schema	table_name	number_of_rows	total_row_bytes	row_bytes	total_bytes	index_bytes	toast_bytes	table_bytes	total	index	table	toast
22220	monitoring_data	time_series	5.5084934e+08	226.11499436726209	97.18495597524046	124555296768	71006248960	NULL	53549047808	116 GB	66 GB	50 GB	NULL
22230	monitoring_data	time_series_numerical	5.0103974e+08	242.67533732942616	108.90748283270925	121589989376	67007979520	NULL	54582009856	113 GB	62 GB	51 GB	NULL
22238	monitoring_data	time_series_alphabetical	3.25614e+07	255.89714840902735	110.47100539466943	8332369920	4734271488	8192	3598090240	7946 MB	4515 MB	3431 MB	8192 bytes

The 3 tables via CREATE script from pgAdmin 4:

    CREATE TABLE IF NOT EXISTS monitoring_data.time_series
(
    "timestamp" timestamp without time zone NOT NULL,
    property_id character varying(255) COLLATE pg_catalog."default" NOT NULL,
    CONSTRAINT time_series_pkey PRIMARY KEY ("timestamp", property_id),
    CONSTRAINT time_series_property_id_fkey FOREIGN KEY (property_id)
        REFERENCES monitoring_data.properties (property_id) MATCH SIMPLE
        ON UPDATE CASCADE
        ON DELETE RESTRICT
)

varchar(255), e.g. c0022:93EA6CCA3B7D1829_AnalogInput1.status.voltage), which is used as identification for the data point

time_series_numerical:

    CREATE TABLE IF NOT EXISTS monitoring_data.time_series_numerical
(
    -- Inherited from table monitoring_data.time_series: "timestamp" timestamp without time zone NOT NULL,
    -- Inherited from table monitoring_data.time_series: property_id character varying(255) COLLATE pg_catalog."default" NOT NULL,
    value double precision NOT NULL,
    CONSTRAINT time_series_numerical_pkey PRIMARY KEY ("timestamp", property_id),
    CONSTRAINT time_series_numerical_timestamp_property_id_fkey FOREIGN KEY ("timestamp", property_id)
        REFERENCES monitoring_data.time_series ("timestamp", property_id) MATCH SIMPLE
        ON UPDATE NO ACTION
        ON DELETE NO ACTION
)
    INHERITS (monitoring_data.time_series)

time_series_alphabetical: Same as time_series_numerical except the value column

value text COLLATE pg_catalog."default" NOT NULL,

To ensure that the foreign key constraint is fulfilled when writing records to one of the inheriting tables, a trigger was created in conjunction with a function that executes the following function before the INSERT of each record:

BEGIN
    INSERT INTO monitoring_data.time_series (timestamp, property_id)
    SELECT NEW.timestamp, NEW.property_id
    WHERE NOT EXISTS (
        SELECT 1 FROM monitoring_data.time_series 
        WHERE timestamp = NEW.timestamp
        AND property_id = NEW.property_id 
    );
    RETURN NEW;
END;

The time_series table should be used to simplify access to measured values in the underlying tables, regardless of the data type, and to enable all timestamp and property_id entries to be stored together and only once.

As I have already noted, this is not working as expected and has resulted in the data being stored twice (once in the time_series table and once in the respective table for the measurement), and in the time_series table itself, each entry even exists twice.

Query example:

SELECT *
FROM monitoring_data.time_series_numerical
WHERE property_id = 'c0022:9FF441820C58912B_HeatPump.configuration.flowTemperature'  -- Replace <your_property_id> with the actual ID
AND timestamp >= '2024-09-01 00:00:00'
AND timestamp < '2024-09-02 00:00:00'
ORDER BY timestamp;

Result with EXPLAIN (ANALYZE, VERBOSE, BUFFERS, SETTINGS):

"Index Scan using time_series_numerical_pkey on monitoring_data.time_series_numerical  (cost=0.57..1153041.19 rows=4516 width=72) (actual time=0.516..633099.399 rows=5725 loops=1)"
"  Output: ""timestamp"", property_id, value"
"  Index Cond: ((time_series_numerical.""timestamp"" >= '2024-09-01 00:00:00'::timestamp without time zone) AND (time_series_numerical.""timestamp"" < '2024-09-02 00:00:00'::timestamp without time zone) AND ((time_series_numerical.property_id)::text = 'c0022:9FF441820C58912B_HeatPump.configuration.flowTemperature'::text))"
"  Buffers: shared read=199057"
"Settings: search_path = 'monitoring_data'"
"Planning Time: 0.485 ms"
"JIT:"
"  Functions: 2"
"  Options: Inlining true, Optimization true, Expressions true, Deforming true"
"  Timing: Generation 2.138 ms, Inlining 0.000 ms, Optimization 0.000 ms, Emission 0.000 ms, Total 2.138 ms"
"Execution Time: 633108.973 ms"

Before I begin to implement a solution for this problem, I am seeking advice here, as I am just starting to work with PostgreSQL.

Thoughts on solutions and questions:

I currently have two ideas to tackle the problem, but I would like to know what other options I have and what would make sense or be considered best practice for my use case.

Delete all entries from the time_series table, remove the foreign key constraints from both inheriting tables, as well as the triggers and functions, so that access via the time_series table to the inheriting tables is still possible, but significantly less storage space is used.
Use partitioning to split the data across multiple tables based on the timestamp in order to speed up queries.

It is still unclear to me why the indexes take up so much space and how I can reduce that.

Additional info:

PostgreSQL Version 14.9; there are no additional indexes besides the primary keys; all indexes are b-tree indexes; there are no dead tuples; autovacuum is enabled; extensions: postgis, fuzzystrmatch, uuid-ossp; it was decided that time_series_numerical and time_series_alphabetical should be separated

Solution

The OP's original design icludes the full property name (called property_id in the post) with each sample. The example name is 62 characters long, which seems to be a reasonable estimate for the mean name length given the table statistics that were provided. Repeating the name with each sample is an extremely inefficient use of space.

The following demonstrates an approach that reduces the required space by replacing the property name in each sample with a reference to a properties table. This approach results in an order of magnitude improvement in storage efficiency. Some column names have been changed to avoid using PostgreSQL keywords.

Below are the table definitions (time_series_alphabetical is omitted since it is nearly identical to time_series_numerical):

CREATE TABLE properties (
  id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  created_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
  updated_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
  property_name TEXT NOT NULL UNIQUE,
  CONSTRAINT properties_updated_at_ck CHECK (created_at <= updated_at)
);

CREATE TABLE time_series_numerical (
  sample_timestamp TIMESTAMPTZ,
  property_id BIGINT REFERENCES properties (id),
  created_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
  updated_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
  sample_value DOUBLE PRECISION,
  CONSTRAINT time_series_numerical_pk PRIMARY KEY (property_id, sample_timestamp),
  CONSTRAINT time_series_numerical_updated_at_ck CHECK (created_at <= updated_at)
);

The audit columns (created_at and updated_at) can be omitted if there isn't a need or desire to track those attributes. Timestamps are declared as TIMESTAMPTZ as recommended at Date/Time storage. The primary key for time_series_numerical is (property_id, sample_timestamp) to facilitate efficient retrieval by property. If retrieval by sample_timestamp is also common, then an additional index should be added with sample_timestamp as the first attribute.

The following uses a single SQL statement to insert a set of samples with the ability to automatically add properties that have not previously been recorded:

WITH
  samples AS (
    SELECT
      ts::TIMESTAMPTZ AS sample_timestamp,
      property_name,
      sample_value
    FROM
      (
        VALUES
          ('2024-01-01 00:00:00', 'prop1', 1),
          ('2024-01-01 01:00:00', 'prop1', 2),
          ('2024-01-01 00:00:00', 'prop2', 3),
          ('2024-01-01 00:00:00', 'prop4', 4),
          ('2024-01-01 00:00:00', 'prop3', 5)
      ) v (ts, property_name, sample_value)
  ),
  new_properties AS (
    INSERT INTO properties (property_name)
    SELECT DISTINCT property_name
      FROM samples
      ON CONFLICT DO NOTHING
      RETURNING id, property_name
  )
INSERT INTO time_series_numerical AS tsn (sample_timestamp, property_id, sample_value)
SELECT s.sample_timestamp,
       COALESCE(p.id,
                (SELECT np.id
                   FROM new_properties np
                   WHERE np.property_name = s.property_name)) AS property_id,
       s.sample_value
  FROM samples s
  LEFT JOIN properties p ON p.property_name = s.property_name
  ON CONFLICT (sample_timestamp, property_id)
    DO UPDATE SET sample_value = EXCLUDED.sample_value,
                  updated_at = CURRENT_TIMESTAMP
         WHERE tsn.sample_value IS DISTINCT FROM EXCLUDED.sample_value;

The ON CONFLICT clause updates a previously recorded sample if the value has changed.

Running the following query demonstrates that each property has only been inserted once:

SELECT * FROM properties ORDER BY property_name;

id	created_at	updated_at	property_name
3	2024-09-22 15:16:27.151381-04	2024-09-22 15:16:27.151381-04	prop1
1	2024-09-22 15:16:27.151381-04	2024-09-22 15:16:27.151381-04	prop2
2	2024-09-22 15:16:27.151381-04	2024-09-22 15:16:27.151381-04	prop3
4	2024-09-22 15:16:27.151381-04	2024-09-22 15:16:27.151381-04	prop4

Samples along with their associated property name can be retrieved with the following query:

SELECT
  tsn.sample_timestamp,
  tsn.property_id,
  p.property_name,
  tsn.sample_value,
  tsn.created_at,
  tsn.updated_at
FROM
  time_series_numerical tsn
  JOIN properties p ON p.id = tsn.property_id;

sample_timestamp	property_id	property_name	sample_value	created_at	updated_at
2024-01-01 00:00:00-05	3	prop1	1	2024-09-22 15:16:27.151381-04	2024-09-22 15:16:27.151381-04
2024-01-01 01:00:00-05	3	prop1	2	2024-09-22 15:16:27.151381-04	2024-09-22 15:16:27.151381-04
2024-01-01 00:00:00-05	1	prop2	3	2024-09-22 15:16:27.151381-04	2024-09-22 15:16:27.151381-04
2024-01-01 00:00:00-05	4	prop4	4	2024-09-22 15:16:27.151381-04	2024-09-22 15:16:27.151381-04
2024-01-01 00:00:00-05	2	prop3	5	2024-09-22 15:16:27.151381-04	2024-09-22 15:16:27.151381-04

To demonstrate the improved storage efficiency, run the following block to generate synthetic data useful for exploring the effects of different indexing options and system settings (on my 2017 MacBook Pro, this took ~1 hour to complete):

DO $BLOCK$
<<local>>
DECLARE
  num_properties CONSTANT INTEGER := 2670;
  property_name_length CONSTANT INTEGER := 50;
  sampling_interval CONSTANT INTERVAL := INTERVAL '15' SECOND;
  missed_sample_probability CONSTANT DOUBLE PRECISION := 0.006;
  sample_date_range CONSTANT TSTZRANGE :=  TSTZRANGE('2024-09-01'::TIMESTAMP AT TIME ZONE 'UTC',
                                                     '2024-09-15'::TIMESTAMP AT TIME ZONE 'UTC');
BEGIN
  TRUNCATE TABLE time_series_numerical;
  TRUNCATE TABLE properties CASCADE;
  PERFORM SETVAL('properties_id_seq', 1, FALSE);

  INSERT INTO properties(property_name)
  SELECT SUBSTRING(REPEAT(ENCODE(UUID_SEND(GEN_RANDOM_UUID()), 'HEX'),
                          CEIL(local.property_name_length / 16.0)::INTEGER),
                   local.property_name_length)
    FROM GENERATE_SERIES(1, local.num_properties) gs(n);

  INSERT INTO time_series_numerical(sample_timestamp, property_id, sample_value)
  SELECT sample_times.sample_timestamp, props.id, RANDOM() * 10000
    FROM GENERATE_SERIES(1, local.num_properties) props(id)
    CROSS JOIN GENERATE_SERIES(LOWER(local.sample_date_range),
                               UPPER(local.sample_date_range) - INTERVAL '0.000001' SECOND,
                               local.sampling_interval) sample_times(sample_timestamp)
    WHERE random() > local.missed_sample_probability;
END;
$BLOCK$ language plpgsql;

This resulted in 2670 rows in properties using ~352 KB for table data and ~440 KB for indexes and 214,017,381 rows in time_series_numerical using ~13.61 GB for table data and ~8.9 GB for indexes (removing the audit columns, created_at and updated_at, reduces the table size by ~3.2 GB).

Two main factors contributed to the excessively long query time with the OP's original table structure: inefficient index scanning caused by index column order, and excessive block reads caused by low tuple density. Querying for an individual property would be considerably more efficient if the property column appeared first in the index definition since that would reduce the number of block reads required to access all of the related samples. Because the property names are very long, they require significantly more bytes than would be needed by an integer ID reference, which means significantly more block reads (for the OP's case, using an INTEGER reference would reduce the number of block reads by ~16 and using BIGINT by ~8). With the above table structure, query performance is improved by several orders of magnitude as demonstrated by the following:

EXPLAIN(ANALYZE,BUFFERS,VERBOSE)
SELECT
  tsn.sample_timestamp,
  tsn.property_id,
  p.property_name,
  tsn.sample_value,
  tsn.created_at,
  tsn.updated_at
FROM
  time_series_numerical tsn
  JOIN properties p ON p.id = tsn.property_id
WHERE p.property_name = '80c2bc53040057ac1fca90fa61a492e980c2bc53040057ac1fca90fa61a492e980c2bc53040057a'
  AND tsn.sample_timestamp >= '2024-09-01'::TIMESTAMP
  AND tsn.sample_timestamp < '2024-09-02'::TIMESTAMP;

QUERY PLAN
Nested Loop  (cost=196.53..21870.23 rows=5622 width=120) (actual time=3.191..11.325 rows=5723 loops=1)
  Output: tsn.sample_timestamp, tsn.property_id, p.property_name, tsn.sample_value, tsn.created_at, tsn.updated_at
  Buffers: shared hit=5768 read=2
  ->  Index Scan using properties_property_name_key on public.properties p  (cost=0.28..8.30 rows=1 width=88) (actual time=0.637..0.641 rows=1 loops=1)
        Output: p.id, p.created_at, p.updated_at, p.property_name
        Index Cond: (p.property_name = '80c2bc53040057ac1fca90fa61a492e980c2bc53040057ac1fca90fa61a492e980c2bc53040057a'::text)
        Buffers: shared hit=1 read=2
  ->  Bitmap Heap Scan on public.time_series_numerical tsn  (cost=196.25..21805.71 rows=5622 width=40) (actual time=2.544..7.690 rows=5723 loops=1)
        Output: tsn.sample_timestamp, tsn.property_id, tsn.created_at, tsn.updated_at, tsn.sample_value
        Recheck Cond: ((tsn.property_id = p.id) AND (tsn.sample_timestamp >= '2024-09-01 00:00:00'::timestamp without time zone) AND (tsn.sample_timestamp < '2024-09-02 00:00:00'::timestamp without time zone))
        Heap Blocks: exact=5723
        Buffers: shared hit=5767
        ->  Bitmap Index Scan on time_series_numerical_pk  (cost=0.00..194.84 rows=5622 width=0) (actual time=1.574..1.575 rows=5723 loops=1)
              Index Cond: ((tsn.property_id = p.id) AND (tsn.sample_timestamp >= '2024-09-01 00:00:00'::timestamp without time zone) AND (tsn.sample_timestamp < '2024-09-02 00:00:00'::timestamp without time zone))
              Buffers: shared hit=44
Planning:
  Buffers: shared hit=78 read=11
Planning Time: 9.076 ms
Execution Time: 11.783 ms