Search code examples
timescaledb

Compressed chunks: performance and data size


I'm new to TimescaleDB and started with exploring the documentation. It's pretty clear and looks like I've missed something important.

I've created a table:

CREATE TABLE session_created
(
    event_timestamp timestamp without time zone NOT NULL,
    client_id integer,
    client_version text,
    identifier text,
    platform text,
    remote_addr text,
    country text,
    type smallint,
    session text
);

CREATE INDEX session_created_client_id_idx ON session_created USING btree (client_id ASC NULLS LAST);
CREATE INDEX session_created_event_timestamp_idx ON session_created USING btree (event_timestamp ASC NULLS LAST);

Then made it a hypertable with following compression settings:

SELECT create_hypertable('session_created','event_timestamp');

ALTER TABLE session_created SET (
  timescaledb.compress,
  timescaledb.compress_segmentby = 'event_timestamp'
);

SELECT add_compression_policy('session_created', INTERVAL '1 days');

And filled the table with one million rows a day from 2021-11-09 to 2021-11-2. Like this:

INSERT INTO session_created
(
    event_timestamp,
    client_id,
    client_version,
    identifier,
    platform,
    remote_addr,
    country,
    type,
    session
)
SELECT '2021-11-23 00:00:00'::timestamp + s.id * interval '85 milliseconds', 
    s.id % 500000, 
    '1.0.1234', 
    'deviceid-' || s.id % 500000,
    'Android',
    '127.0.0.' || s.id % 256,
    'RU',
    0,
    md5(random()::text || clock_timestamp()::text)::uuid
FROM generate_series(1, 1000000) AS s(id);

The target table contained three chunks and I've compressed two of them to play with core TimescaleDB feature:

SELECT compress_chunk(chunk_name)
FROM show_chunks('session_created', older_than => INTERVAL ' 1 day') chunk_name;

The problem is that compressed data took three much space than data before compression.

before compression after compression
1702 MB 4178 MB

Also, it takes more time to query compressed data in an analitic manner:

select event_timestamp::date as date, count(distinct client_id) as clients
from session_created
where event_timestamp between (current_date - 7)::date and (current_date - 6)::date - interval '1 second'
group by 1

The question is what I've missed in the documentation? What are sources of the problems?


Solution

  • Typically the "time" column used to create the hypertable is not used as a segment by column while setting up compression. segment_by column is a column that has some commonality across the data set. e.g. If we have a table with device readings (device_id, event_timestamp, event_id, reading) the segment by column could be device_id (say you have a few 1000 devices and the device_readings table has data in the order of millions/billions). Note that the data in the segment by column is never stored in compressed form. Only the non segment by columns get compressed.