I need to join 2 tables in Postgres on multiple columns. The columns have a different type in each table.
This is not a question how to join the tables (this has been asked and answered several times), but how to join them efficiently, with full usage of the indexes. The tables have millions of records.
If I am to design a new table with columns always having int
value, but on which I need to JOIN
to another already existing old table with text
values, what would lead to better performance:
int
and sacrificing the performance of the JOIN
, ortext
to avoid casting and have the JOIN
"direct", and sacrificing the size of the columns and the performance of the indexes (I assume that an index on int
is much more efficient than an index on text
).CREATE TABLE table_new
(
id1 int NOT NULL,
id2 int NOT NULL,
-- other columns
);
CREATE INDEX idx_new_id ON table_new (id1, id2);
CREATE TABLE table_old
(
id1 text NOT NULL,
id2 text NOT NULL,
-- other columns
);
CREATE INDEX idx_old_id ON table_old (id1, id2);
I need to join table_old
and table_new
on id1
and id2
. I know that I can use various forms of cast (as described e.g. Join two tables in postgresql if type of columns is different).
What I effectively need to achieve is an equivalent of
SELECT * FROM table_old
JOIN table_new
ON table_old.id1 = table_new.id1 AND table_old.id2 = table_new.id2;
However, how to do it so it uses the indexes, or how to create the indexes that the join is efficient?
I am worried that any form of casting id1
and id2
between int
and text
will destroy the performance.
To avoid the XY problem, here is the wider context:
table_old
is a legacy table which I am not allowed to change.table_new
is my new table which I am designing.table_new
the columns id1
and id2
always contain int
value, so I wanted to make the table and its indexes as small and as fast as possible, hence I designed them as int
.JOIN
the tables with different types of column, I can also accept as a solution that the columns would be text
.JOIN
.Functional indexes to the rescue!
CREATE TEMP TABLE t_old (tid text PRIMARY KEY);
CREATE TEMP TABLE t_new (iid int NOT NULL);
CREATE INDEX t_old_int_idx ON t_old ((tid::int));
CREATE INDEX t_new_int_idx ON t_new (iid);
INSERT INTO t_old SELECT i::text FROM generate_series(1,99999) i;
INSERT INTO t_new VALUES (10),(20),(30);
ANALYSE t_old;
ANALYSE t_new;
EXPLAIN ANALYSE SELECT t_new.*, t_old.* FROM t_new JOIN t_old ON t_new.iid = t_old.tid::int;
┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Merge Join (cost=1.69..2.51 rows=3 width=9) (actual time=0.063..0.082 rows=3 loops=1) │
│ Merge Cond: ((t_old.tid)::integer = t_new.iid) │
│ -> Index Scan using t_old_int_idx on t_old (cost=0.29..3050.28 rows=99999 width=5) (actual time=0.016..0.028 rows=31 loops=1) │
│ -> Sort (cost=1.05..1.06 rows=3 width=4) (actual time=0.021..0.022 rows=3 loops=1) │
│ Sort Key: t_new.iid │
│ Sort Method: quicksort Memory: 25kB │
│ -> Seq Scan on t_new (cost=0.00..1.03 rows=3 width=4) (actual time=0.009..0.010 rows=3 loops=1) │
│ Planning Time: 0.485 ms │
│ Execution Time: 0.127 ms │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
If not all the values in your old table are in fact valid integers you might need to make your functional index partial too.