Search code examples
sqlpostgresqlrelational-division

Checking two distinct lists with AND operation on same column


I have the following example data structure of customer that can be part of multiple groups using a junction table and data:

CREATE TABLE customer(id) AS VALUES (0),(1),(2),(3);
CREATE TABLE groups(id)   AS VALUES (1),(3),(5),(6);
CREATE TABLE customers_to_groups(customer_id, group_id) AS
  VALUES (0, 1)--customer 0 is in group (5 OR 6) AND (1 OR 3)
        ,(0, 5)--customer 0 is in group (5 OR 6) AND (1 OR 3)
        ,(1, 1)
        ,(1, 90)
        ,(2, 1)
        ,(3, 3)--customer 3 is in group (5 OR 6) AND (1 OR 3)
        ,(3, 5)--customer 3 is in group (5 OR 6) AND (1 OR 3)
        ,(3, 90);

I need to get customers that have specific groups they are part of, and I need to get a list of all customers that are part of at least 1 group in multiple lists of group. For example I want to get all customers that are in group (5 OR 6) AND (1 OR 3), so for example a customer in group 5 and 1 wold be a returned, but somebody in group 1 and 90 or just group 1 not. With the provided sample data we would get the customer of id 0 and 3 only as they conform to the given rules above.

Just doing WHERE group_id IN (5,6) AND group_id IN (1,3) does not seem to work, so I am looking for alternative.

I got this so far that works:

SELECT DISTINCT c.id
FROM customer c
INNER JOIN customers_to_groups at1 ON c.id = at1.customer_id
INNER JOIN customers_to_groups at2 ON c.id = at2.customer_id
WHERE at1.group_id IN (5, 6)
  AND at2.group_id IN (1, 3);

Expected Results:

id
0
3

Is there a way to do it that is more performant?


Solution

  • We can GROUP BY customer's id and use a HAVING clause. There we can use CASE or FILTER if your RDBMS supports it. Postgres should.

    There your conditions will be set.

    The query will be:

    SELECT c.id 
    FROM customer c
    INNER JOIN customers_to_groups ctg
      ON c.id = ctg.customer_id
      AND ctg.group_id IN (1,3,5,6)
    GROUP BY c.id
    HAVING 
      COUNT(CASE WHEN ctg.group_id IN (1,3) THEN 1 END) > 0
      AND COUNT(CASE WHEN ctg.group_id IN (5,6) THEN 1 END) > 0;
    

    or

    SELECT c.id 
    FROM customer c
    INNER JOIN customers_to_groups ctg
      ON c.id = ctg.customer_id
      AND ctg.group_id IN (1,3,5,6)
    GROUP BY c.id
    HAVING 
      COUNT(*) FILTER(WHERE ctg.group_id IN (1,3)) > 0
      AND COUNT(*) FILTER(WHERE ctg.group_id IN (5,6)) > 0;
    

    Note: Above queries assume you really need to include both customers and customers_to_groups table and join them. If you don't need to include the customers table, just remove it and select from the table customers_to_groups only to improve the performance:

    SELECT customer_id
    FROM customers_to_groups ctg
    WHERE ctg.group_id IN (1,3,5,6)
    GROUP BY customer_id
    HAVING 
      COUNT(*) FILTER(WHERE ctg.group_id IN (1,3)) > 0
      AND COUNT(*) FILTER(WHERE ctg.group_id IN (5,6)) > 0;
    

    This demo with your sample data and a lot of additional rows shows the performance differences.

    Using FILTER or CASE is far faster (as it avoids a second JOIN on the table customers_to_groups). The exact performance difference depends on the real data in your tables and which index(es) you use.