Search code examples
sqlpostgresqlrelational-division

Making simple SQL more efficient


SQL Fiddle.

I'm having a slow start to the morning. I thought there was a more efficient way to make the following query using a join, instead of two independent selects -- am I wrong?

Keep in mind that I've simplified/reduced my query into this example for SO purposes, so let me know if you have any questions as well.

SELECT DISTINCT c.* 
FROM   customers c
WHERE  c.customer_id IN (select customer_id from customers_cars where car_make = 'BMW')
  AND  c.customer_id IN (select customer_id from customers_cars where car_make = 'Ford')
;

Sample Table Schemas

-- Simple tables to demonstrate point
CREATE TABLE customers (
  customer_id serial,
  name text
  );

CREATE TABLE customers_cars (
  customer_id integer,
  car_make text
  );


-- Populate tables
INSERT INTO customers(name) VALUES
  ('Joe Dirt'),
  ('Penny Price'),
  ('Wooten Nagen'),
  ('Captain Planet')
;

INSERT INTO customers_cars(customer_id,car_make) VALUES
  (1,'BMW'),
  (1,'Merc'),
  (1,'Ford'),
  (2,'BMW'),
  (2,'BMW'),      -- Notice car_make is not unique
  (2,'Ferrari'),
  (2,'Porche'),
  (3,'BMW'),
  (3,'Ford');
-- ids 1 and 3 both have BMW and Ford

Other Expectations

  • There are ~20 car_make in the database
  • There are typically 1-3 car_make per customer_id
  • There is expected to be not more than 50 car_make assignments per customer_id (generally 20-30)
  • The query is generally only going to look for 2-3 specific car_make per customer (e.g., BMW and Ford), but not 10-20

Solution

  • You don't need to join to customers at all (given relational integrity).

    Generally, this is a case of relational division. We assembled an arsenal of techniques under this related question:

    Unique combinations

    If (customer_id, car_make) was defined unique in customers_cars, it would get much simpler:

    SELECT customer_id
    FROM   customers_cars
    WHERE  car_make IN ('BMW', 'Ford')
    GROUP  BY 1
    HAVING count(*) = 2;
    

    Combinations not unique

    Since (customer_id, car_make) is not unique, we need an extra step.

    For only a few cars, your original query is not that bad. But (especially with duplicates!) EXISTS is typically faster than IN, and we don't need the final DISTINCT:

    SELECT customer_id -- no DISTINCT needed.
    FROM   customers c
    WHERE  EXISTS (SELECT 1 FROM customers_cars WHERE customer_id = c.customer_id AND car_make = 'BMW')
    AND    EXISTS (SELECT 1 FROM customers_cars WHERE customer_id = c.customer_id AND car_make = 'Ford');
    

    Above query gets verbose and less efficient for a longer list of cars. For an arbitrary number of cars I suggest:

    SELECT customer_id
    FROM  (
       SELECT customer_id, car_make
       FROM   customers_cars
       WHERE  car_make IN ('BMW', 'Ford')
       GROUP  BY 1, 2
       ) sub
    GROUP  BY 1
    HAVING count(*) = 2;
    

    SQL Fiddle.