Search code examples
sqlpostgresqlduplicatespostgresql-13

Use DISTINCT to display unique items


I have 2 tables which I would like to query and display data differences:

CREATE TABLE order_splits_config (
    id                   INT,
    pair_id              INT
);

CREATE TABLE active_pairs (
    id                  INT,
    pair                VARCHAR(30),
    exchange_active     boolean,
    exchange_id         INT
);

INSERT INTO order_splits_config(id, pair_id)
VALUES (1, 83);
INSERT INTO order_splits_config(id, pair_id)
VALUES (2, 58);
INSERT INTO order_splits_config(id, pair_id)
VALUES (34, 34);

INSERT INTO active_pairs(id, pair, exchange_active, exchange_id)
VALUES (1, 'US/EN', true, 2);
INSERT INTO active_pairs(id, pair, exchange_active, exchange_id)
VALUES (2, 'GB/UK', true, 3);
INSERT INTO active_pairs(id, pair, exchange_active, exchange_id)
VALUES (2, 'FR/EU', true, 4);

I use this query to query the differences:

SELECT b.id, b.pair, b.exchange_id
FROM order_splits_config a
FULL OUTER JOIN active_pairs b
ON a.pair_id = b.id
WHERE a.pair_id IS NULL
OR b.id IS NULL
AND b.exchange_active = 'true';

This prints a lot of lines like this for example (data is just for example):

#,pair,id,exchange_id
1, US/EN,332,1
2, GB/UK,112,1
3, GB/UK,113,1
4, FR/EU,221,5
5, FR/EU,183,2
...

How can I use DISTINCT in order to get the result from the query unique by pair?

FIDDLE: https://www.db-fiddle.com/f/4D6VfqysPCWhQnh8zaFBps/2


Solution

  • You only select columns from table active_pairs to begin with.

    After joining on a.pair_id = b.id, the condition a.pair_id IS NULL is only true if there is no matching row in table order_splits_config AS a, which is more efficiently formulated as NOT EXISTS instead of the FULL JOIN.

    As for b.id IS NULL AND b.exchange_active = 'true': again, b.id can only be NULL for rows of a where no matching b is found - in which case b.exchange_active can never be true. So the clause effectively eliminates all rows of a that were just added by the FULL JOIN and which would show up in the result as all NULL values since there are only columns from b. In short: don't FULL JOIN to begin with.

    If active_pairs.pair is defined UNIQUE, it boils down to:

    SELECT id, pair, exchange_id
    FROM   active_pairs b
    WHERE  NOT EXISTS (SELECT FROM order_splits_config a WHERE a.pair_id = b.id);
    

    If active_pairs.pair is not defined UNIQUE, the simple solution is with DISTINCT ON:

    SELECT DISTINCT ON (pair)
           id, pair, exchange_id
    FROM   active_pairs b
    WHERE  NOT EXISTS (SELECT FROM order_splits_config a WHERE a.pair_id = b.id);
    

    db<>fiddle here

    From each set of duplicates on pair, you get an arbitrary pick.
    For a deterministic pick, define what to pick and add an ORDER BY clause accordingly. Example: to get the one with the smallest id, add:

    ...
    ORDER BY pair, id;
    

    Also adds a sort order to the so far unsorted result.

    If order_splits_config isn't trivially small, have an index on (pair_id) to make this fast.

    There may be much faster solutions for a big active_pairs table, depending on undisclosed information, mostly the cardinality of column pair (how many duplicate values).

    See: