I have looked over this about 4 times and am still perplexed with these results.
Take a look at the following (which I originally posted here)
-- Some output omitted
DROP TABLE IF EXISTS dim_calendar CASCADE;
CREATE TABLE dim_calendar (
id SMALLSERIAL PRIMARY KEY,
day_id DATE NOT NULL,
year SMALLINT NOT NULL, -- 2000 to 2024
month SMALLINT NOT NULL, -- 1 to 12
day SMALLINT NOT NULL, -- 1 to 31
quarter SMALLINT NOT NULL, -- 1 to 4
day_of_week SMALLINT NOT NULL, -- 0 () to 6 ()
day_of_year SMALLINT NOT NULL, -- 1 to 366
week_of_year SMALLINT NOT NULL, -- 1 to 53
CONSTRAINT con_month CHECK (month >= 1 AND month <= 31),
CONSTRAINT con_day_of_year CHECK (day_of_year >= 1 AND day_of_year <= 366), -- 366 allows for leap years
CONSTRAINT con_week_of_year CHECK (week_of_year >= 1 AND week_of_year <= 53),
UNIQUE(day_id)
);
INSERT INTO dim_calendar (day_id, year, month, day, quarter, day_of_week, day_of_year, week_of_year) (
SELECT ts,
EXTRACT(YEAR FROM ts),
EXTRACT(MONTH FROM ts),
EXTRACT(DAY FROM ts),
EXTRACT(QUARTER FROM ts),
EXTRACT(DOW FROM ts),
EXTRACT(DOY FROM ts),
EXTRACT(WEEK FROM ts)
FROM generate_series('2000-01-01'::timestamp, '2024-01-01', '1day'::interval) AS t(ts)
);
/* ==> [ INSERT 0 8767 ] */
DROP TABLE IF EXISTS just_dates CASCADE;
DROP TABLE IF EXISTS just_date_ids CASCADE;
CREATE TABLE just_dates AS
SELECT a_date AS some_date
FROM some_table;
/* ==> [ SELECT 769411 ] */
CREATE TABLE just_date_ids AS
SELECT d.id
FROM just_dates jd
INNER JOIN dim_calendar d
ON d.day_id = jd.some_date;
/* ==> [ SELECT 769411 ] */
ALTER TABLE just_date_ids ADD CONSTRAINT jdfk FOREIGN KEY (id) REFERENCES dim_calendar (id);
pocket=# SELECT pg_size_pretty(pg_relation_size('dim_calendar'));
pg_size_pretty
----------------
448 kB
(1 row)
pocket=# SELECT pg_size_pretty(pg_relation_size('just_dates'));
pg_size_pretty
----------------
27 MB
(1 row)
pocket=# SELECT pg_size_pretty(pg_relation_size('just_date_ids'));
pg_size_pretty
----------------
27 MB
(1 row)
Why is a table consisting of a bunch of smallints the same size as a table consisting of a bunch of dates? And I should mention that before, when dim_calendar.id
was a normal SERIAL
, it gave the same 27MB
result.
Also, and more importantly -- WHY does a table with 769411
records with a single smallint field have a size of 27MB
, which is > 32bytes/record
???
P.S. Yes, I will have billions (or at a minimum hundreds of millions) of records, and am trying to add performance and space optimizations wherever possible.
This might have something to do with it, so throwing it out there --
pocket=# select count(id) from just_date_ids group by id;
count
--------
409752
359659
(2 rows)
In tables with one or two columns, the biggest part of the size is always the Tuple Header.
Have a look here http://www.postgresql.org/docs/current/interactive/storage-page-layout.html, it explains how the data is stored. I'm quoting the part of the above page that is most relevant with your question
All table rows are structured in the same way. There is a fixed-size header (occupying 23 bytes on most machines), followed by an optional null bitmap, an optional object ID field, and the user data.
This mostly explains the question
WHY does a table with 769411 records with a single smallint field have a size of 27MB, which is > 32bytes/record???
The other part of your question has to do with the byte alignment of postgres data. Smallints are aligned in 2-byte offsets, but ints (and dates of course... date
is an int4
after all) are aligned in 4 bytes offsets. So the order in which the table columns are devlared plays a significant role.
Having a table with smallint, date, smallint needs 12 bytes for user data (not counting the overhead), while declaring smallint, smallint, date only will need 8 bytes. See a great (and surprisingly not accepted) answer here Calculating and saving space in PostgreSQL