postgresql query-optimization query-performance

Optimizing a query that compares a table to itself with millions of rows

I could use some help optimizing a query that compares rows in a single table with millions of entries. Here's the table's definition:

CREATE TABLE IF NOT EXISTS data.row_check (
    id         uuid NOT NULL DEFAULT NULL,
    version    int8 NOT NULL DEFAULT NULL,
    row_hash   int8 NOT NULL DEFAULT NULL,
    table_name text NOT NULL DEFAULT NULL,

CONSTRAINT row_check_pkey
    PRIMARY KEY (id, version)
);

I'm reworking our push code and have a test bed with millions of records across about 20 tables. I run my tests, get the row counts, and can spot when some of my insert code has changed. The next step is to checksum each row, and then compare the rows for differences between versions of my code. Something like this:

-- Run my test of "version 0" of the push code, the base code I'm refactoring.  
-- Insert the ID and checksum for each pushed row.
INSERT INTO row_check (id,version,row_hash,table_name)
            SELECT id, 0, hashtext(record_changes_log::text),'record_changes_log' 
            FROM record_changes_log

            ON CONFLICT ON CONSTRAINT row_check_pkey DO UPDATE SET
                row_hash   = EXCLUDED.row_hash,
                table_name = EXCLUDED.table_name;

truncate table record_changes_log;

-- Run my test of "version 1" of the push code, the new code I'm validating.
-- Insert the ID and checksum for each pushed row.

INSERT INTO row_check (id,version,row_hash,table_name)
            SELECT id, 1, hashtext(record_changes_log::text),'record_changes_log' 
            FROM record_changes_log

            ON CONFLICT ON CONSTRAINT row_check_pkey DO UPDATE SET
                row_hash   = EXCLUDED.row_hash,
                table_name = EXCLUDED.table_name;

That gets two rows in row_check for every row in record_changes_log, or any other table I'm checking. For the two runs of record_changes_log, I end up with more than 8.6M rows in row_check. They look like this:

id                                      version row_hash    table_name
e6218751-ab78-4942-9734-f017839703f6    0   -142492569  record_changes_log
6c0a4111-2f52-4b8b-bfb6-e608087ea9c1    0   -1917959999 record_changes_log
7fac6424-9469-4d98-b887-cd04fee5377d    0   -323725113  record_changes_log
1935590c-8d22-4baf-85ba-00b563022983    0   -1428730186 record_changes_log
2e5488b6-5b97-4755-8a46-6a46317c1ae2    0   -1631086027 record_changes_log
7a645ffd-31c5-4000-ab66-a565e6dad7e0    0   1857654119  record_changes_log

I asked yesterday for some help on the comparison query, and it lead to this:

 select v0.table_name,
        v0.id,
        v0.row_hash as v0,
        v1.row_hash as v1   

   from row_check v0 
   join row_check v1 on v0.id = v1.id  and
        v0.version = 0 and
        v1.version  = 1 and
        v0.row_hash <> v1.row_hash

That works, but now I'm hoping to optimize it a bit. As an experiment, I clustered the data on version and then built a BRIN index, like this:

drop index if exists row_check_version_btree;
create index row_check_version_btree
          on row_check
        using btree(version);

cluster row_check using row_check_version_btree;    
drop index row_check_version_btree; -- Eh? I want to see how the BRIN performs.

drop index if exists row_check_version_brin;
create index row_check_version_brin
          on row_check
        using brin(row_hash);

vacuum analyze row_check;

I ran the query through explain analyze and get this:

Merge Join  (cost=1.12..559750.04 rows=4437567 width=51) (actual time=1511.987..14884.045 rows=10 loops=1)
  Output: v0.table_name, v0.id, v0.row_hash, v1.row_hash
  Inner Unique: true
  Merge Cond: (v0.id = v1.id)
  Join Filter: (v0.row_hash <> v1.row_hash)
  Rows Removed by Join Filter: 4318290
  Buffers: shared hit=8679005 read=42511
  ->  Index Scan using row_check_pkey on ascendco.row_check v0  (cost=0.56..239156.79 rows=4252416 width=43) (actual time=0.032..5548.180 rows=4318300 loops=1)
        Output: v0.id, v0.version, v0.row_hash, v0.table_name
        Index Cond: (v0.version = 0)
        Buffers: shared hit=4360752
  ->  Index Scan using row_check_pkey on ascendco.row_check v1  (cost=0.56..240475.33 rows=4384270 width=24) (actual time=0.031..6070.790 rows=4318300 loops=1)
        Output: v1.id, v1.version, v1.row_hash, v1.table_name
        Index Cond: (v1.version = 1)
        Buffers: shared hit=4318253 read=42511
Planning Time: 1.073 ms
Execution Time: 14884.121 ms

...which I did not really get the right idea from...so I ran it again to JSON and fed the results into this wonderful plan visualizer:

http://tatiyants.com/pev/#/plans

The tips there are right, the top node estimate is bad. The result is 10 rows, the estimate is for about 443,757 rows.

I'm hoping to learn more about optimizing this kind of thing, and this query seems like a good opportunity. I have a lot of notions about what might help:

-- CREATE STATISTICS?
-- Rework the query to move the where comparison?
-- Use a better index? I did try a GIN index and a straight B-tree on version, but neither was superior.
-- Rework the row_check format to move the two hashes into the same row instead of splitting them over two rows, compare on insert/update, flag non-matches, and add a partial index for the non-matching values.

Granted, it's funny to even try to index something where there are only two values (0 and 1 in the case above), so there's that. In fact, is there any sort of clever trick for Booleans? I'll always be comparing two versions, so "old" and "new", which I can express however makes life best. I understand that Postgres only has bitmap indexes internally at search/merge (?) time and that it does not have a bitmap type index. Would there be some kind of INTERSECT that might help? I don't know how Postgres implements set math operators internally.

Thanks for any suggestions on how to rethink this data or the query to make it faster for comparisons involving millions, or tens of millions, of rows.

Solution

I'm going to add an answer to my own question, but am still interested in what anyone else has to say. In the process of writing out my original question, I realized that maybe a redesign is in order. This hinges on my plan to only ever compare two versions at a time. That's a good solution here, but there are other cases where it wouldn't work. Anyway, here's a slightly different table design that folds the two results into a single row:

DROP TABLE IF EXISTS data.row_compare;
CREATE TABLE IF NOT EXISTS data.row_compare (
    id           uuid NOT NULL DEFAULT NULL,
    hash_1       int8,    -- Want NULL to defer calculating hash comparison until after both hashes are entered.
    hash_2       int8,    -- Ditto
    hashes_match boolean, -- Likewise 
    table_name   text NOT NULL DEFAULT NULL,

CONSTRAINT row_compare_pkey
    PRIMARY KEY (id)
);

The following expression index should, hopefully, be very small as I shouldn't have any non-matching entries:

CREATE INDEX row_compare_fail ON row_compare (hashes_match)
    WHERE hashes_match = false;

The trigger below does the column calculation, once hash_1 and hash_2 are both provided:

-- Run this as a BEFORE INSERT or UPDATE ROW trigger.
CREATE OR REPLACE FUNCTION data.on_upsert_row_compare()
  RETURNS trigger AS 

$BODY$
BEGIN

    IF  NEW.hash_1 = NULL OR 
        NEW.hash_2 = NULL THEN
        RETURN NEW; -- Don't do the comparison, hash_1 hasn't been populated yet.

    ELSE-- Do the comparison. The point of this is to avoid constantly thrashing the expression index.
       NEW.hashes_match := NEW.hash_1 = NEW.hash_2;
      RETURN NEW;     -- important!
   END IF;
END;

$BODY$
LANGUAGE plpgsql;

This now adds 4.3M rows instead of 8.6M rows:

-- Add the first set of results and build out the row_compare records.
INSERT INTO row_compare (id,hash_1,table_name)
            SELECT id, hashtext(record_changes_log::text),'record_changes_log'
            FROM record_changes_log

            ON CONFLICT ON CONSTRAINT row_compare_pkey DO UPDATE SET
                hash_1   = EXCLUDED.hash_1,
                table_name = EXCLUDED.table_name;

-- I'll truncate the record_changes_log and push my sample data again here.

-- Add the second set of results and update the row compare records.
-- This time, the hash is going into the hash_2 field for comparison
INSERT INTO row_compare (id,hash_2,table_name)
            SELECT id, hashtext(record_changes_log::text),'record_changes_log'
            FROM record_changes_log

            ON CONFLICT ON CONSTRAINT row_compare_pkey DO UPDATE SET
                hash_2   = EXCLUDED.hash_2,
                table_name = EXCLUDED.table_name;

And now the results are simple to find:

select * from row_compare where hashes_match = false;

This changes the query time from around 17 seconds to around 24 milliseconds.