Search code examples
sqldatabasejoincardinality

SQL - most efficient way to find if a pair of row does NOT exist


I can't seem to find a similar situation to mine online. I have a table for 'orders' called Order, and a table for details on those orders, called 'order detail'. The definition of a certain type of order is if it has 1 of two pairs of order details (Value-Unit pairs). So, my order detail table might look like this:

order_id | detail
---------|-------
1        | X  
1        | Y
1        | Z
2        | X
2        | Z
2        | B
3        | A
3        | Z
3        | B

The two pairs that go together are (X & Y) and (A & B). What is an efficient way of retrieving only those order_ids that DO NOT contain either one of these pairs? e.g. For the above table, I need to receive only the order_id 2.

The only solution I can come up with is essentially to use two queries and perform a self join:

select distinct o.order_id
from orders o
where o.order_id not in (
    select distinct order_id
    from order_detail od1 where od1.detail=X
    join order_detail od2 on od2.order_id = od1.order_id and od2.detail=Y
) 
and o.order_id not in (
    select distinct order_id
    from order_detail od1 where od1.detail=A
    join order_detail od2 on od2.order_id = od1.order_id and od2.detail=B
)

The problem is that performance is an issue, my order_detail table is HUGE, and I am quite inexperienced in query languages. Is there a faster way to do this with a lower cardinality? I also have zero control over the schema of the tables, so I can't change anything there.


Solution

  • First and foremost I'd like to emphasise that finding the most efficient query is a combination of a good query and a good index. Far too often I see questions here where people look for magic to happen in only one or the other.

    E.g. Of a variety of solutions, yours is the slowest (after fixing syntax errors) when there are no indexes, but is quite a bit better with an index on (detail, order_id)

    Please also note that you have the actual data and table structures. You'll need to experiment with various combinations of queries and indexes to find what works best; not least because you haven't indicated what platform you're using and results are likely to vary between platforms.

    [/ranf-off]


    Query

    Without further ado, Gordon Linoff has provided some good suggestions. There's another option likely to offer similar performance. You said you can't control the schema; but you can use a sub-query to transform the data into a 'friendlier structure'.

    Specifically, if you:

    • pivot the data so you have a row per order_id
    • and columns for each detail you want to check
    • and the intersection is a count of how many orders have that detail...

    Then your query is simply: where (x=0 or y=0) and (a=0 or b=0). The following uses SQL Server's temporary tables to demonstrate with sample data. The queries below work regardless of duplicate id, val pairs.

    /*Set up sample data*/
    declare @t table (
        id int,
        val char(1)
    )
    insert @t(id, val)
    values  (1, 'x'), (1, 'y'), (1, 'z'),
            (2, 'x'), (2, 'z'), (2, 'b'),
            (3, 'a'), (3, 'z'), (3, 'b')
    
    /*Option 1 manual pivoting*/
    select  t.id
    from    (
            select  o.id,
                    sum(case when o.val = 'a' then 1 else 0 end) as a,
                    sum(case when o.val = 'b' then 1 else 0 end) as b,
                    sum(case when o.val = 'x' then 1 else 0 end) as x,
                    sum(case when o.val = 'y' then 1 else 0 end) as y
            from    @t o
            group by o.id
            ) t
    where   (x = 0 or y = 0) and (a = 0 or b = 0)
    
    /*Option 2 using Sql Server PIVOT feature*/
    select  t.id
    from    (
            select  id ,[a],[b],[x],[y]
            from    (select id, val from @t) src
                    pivot (count(val) for val in ([a],[b],[x],[y])) pvt
            ) t
    where   (x = 0 or y = 0) and (a = 0 or b = 0)
    

    It's interesting to note that the query plans for options 1 and 2 above are slightly different. This suggests the possibility of different performance characteristics over large data sets.


    Indexes

    Note that the above will likely process the whole table. So there is little to be gained from indexes. However, if the table has "long rows", an index on only the 2 columns you're working with means that less data needs to be read from disk.

    The query structure you provided is likely to benefit from an indexes such as (detail, order_id). This is because the server can more efficiently check the NOT IN sub-query conditions. How beneficial will depend on the distribution of data in your table.

    As a side note I tested various query options including a fixed version of yours and Gordon's. (Only a small data size though.)

    • Without the above index, your query was slowest in the batch.
    • With the above index, Gordon's second query was slowest.

    Alternative Queries

    Your query (fixed):

    select distinct o.id
    from @t o
    where o.id not in (
        select  od1.id
        from    @t od1 
                inner join @t od2 on 
                    od2.id = od1.id
                and od2.val='Y'
        where   od1.val= 'X'
    ) 
    and o.id not in (
        select  od1.id
        from    @t od1 
                inner join @t od2 on 
                    od2.id = od1.id
                and od2.val='a'
        where   od1.val= 'b'
    )
    

    Mixture between Gordon's first and second query. Fixes the duplicate issue in the first and the performance in the second:

    select id
    from @t od
    group by id
    having (    sum(case when val in ('X') then 1 else 0 end) = 0
             or sum(case when val in ('Y') then 1 else 0 end) = 0
            )
        and(    sum(case when val in ('A') then 1 else 0 end) = 0
             or sum(case when val in ('B') then 1 else 0 end) = 0
            )
    

    Using INTERSECT and EXCEPT:

    select  id
    from    @t
    except
    (
        select  id
        from    @t
        where   val = 'a'
        intersect
        select  id
        from    @t
        where   val = 'b'
    )
    except
    (
        select  id
        from    @t
        where   val = 'x'
        intersect
        select  id
        from    @t
        where   val = 'y'
    )