Search code examples
sqlpostgresqldatabase-designquery-optimization

Performance of JOIN on columns which have different type in each table


I need to join 2 tables in Postgres on multiple columns. The columns have a different type in each table.

This is not a question how to join the tables (this has been asked and answered several times), but how to join them efficiently, with full usage of the indexes. The tables have millions of records.

If I am to design a new table with columns always having int value, but on which I need to JOIN to another already existing old table with text values, what would lead to better performance:

  1. Designing the columns as int and sacrificing the performance of the JOIN, or
  2. Designing the columns as text to avoid casting and have the JOIN "direct", and sacrificing the size of the columns and the performance of the indexes (I assume that an index on int is much more efficient than an index on text).
CREATE TABLE table_new
(
    id1 int NOT NULL,
    id2 int NOT NULL,
    -- other columns
);
CREATE INDEX idx_new_id ON table_new (id1, id2);

CREATE TABLE table_old
(
    id1 text NOT NULL,
    id2 text NOT NULL,
    -- other columns
);
CREATE INDEX idx_old_id ON table_old (id1, id2);

I need to join table_old and table_new on id1 and id2. I know that I can use various forms of cast (as described e.g. Join two tables in postgresql if type of columns is different).

What I effectively need to achieve is an equivalent of

SELECT * FROM table_old
    JOIN table_new 
    ON table_old.id1 = table_new.id1 AND table_old.id2 = table_new.id2;

However, how to do it so it uses the indexes, or how to create the indexes that the join is efficient?

I am worried that any form of casting id1 and id2 between int and text will destroy the performance.

To avoid the XY problem, here is the wider context:

  1. table_old is a legacy table which I am not allowed to change.
  2. table_new is my new table which I am designing.
  3. Both the tables can have millions of records.
  4. I know that in table_new the columns id1 and id2 always contain int value, so I wanted to make the table and its indexes as small and as fast as possible, hence I designed them as int.
  5. If there is no reasonable way to effectively JOIN the tables with different types of column, I can also accept as a solution that the columns would be text.
  6. What I really need to be fast is the JOIN.

Solution

  • Functional indexes to the rescue!

    CREATE TEMP TABLE t_old (tid text PRIMARY KEY);
    CREATE TEMP TABLE t_new (iid int NOT NULL);
    CREATE INDEX t_old_int_idx ON t_old ((tid::int));
    CREATE INDEX t_new_int_idx ON t_new (iid);
    INSERT INTO t_old SELECT i::text FROM generate_series(1,99999) i;
    INSERT INTO t_new VALUES (10),(20),(30);
    ANALYSE t_old;
    ANALYSE t_new;
    EXPLAIN ANALYSE SELECT t_new.*, t_old.* FROM t_new JOIN t_old ON t_new.iid = t_old.tid::int;
    ┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
    │                                                            QUERY PLAN                                                             │
    ├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
    │ Merge Join  (cost=1.69..2.51 rows=3 width=9) (actual time=0.063..0.082 rows=3 loops=1)                                            │
    │   Merge Cond: ((t_old.tid)::integer = t_new.iid)                                                                                  │
    │   ->  Index Scan using t_old_int_idx on t_old  (cost=0.29..3050.28 rows=99999 width=5) (actual time=0.016..0.028 rows=31 loops=1) │
    │   ->  Sort  (cost=1.05..1.06 rows=3 width=4) (actual time=0.021..0.022 rows=3 loops=1)                                            │
    │         Sort Key: t_new.iid                                                                                                       │
    │         Sort Method: quicksort  Memory: 25kB                                                                                      │
    │         ->  Seq Scan on t_new  (cost=0.00..1.03 rows=3 width=4) (actual time=0.009..0.010 rows=3 loops=1)                         │
    │ Planning Time: 0.485 ms                                                                                                           │
    │ Execution Time: 0.127 ms                                                                                                          │
    └───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
    

    If not all the values in your old table are in fact valid integers you might need to make your functional index partial too.