Search code examples
sql

Why do these two queries return different values when their subquery has the same output?


I'm making my way through an SQL workbook, and I was wondering why these two queries return a different result? What are the differences in behavior of WHERE IN and inline views?

Query I wrote
SELECT SUM(invoice_total) AS sum_of_maxes
FROM invoices
WHERE invoice_total IN
    (SELECT MAX(invoice_total)
    FROM invoices
    WHERE invoice_total - credit_total - payment_total > 0
    GROUP BY vendor_id)

Whole query output:

SUM_OF_MAXES
1 52680.48

Subquery output:

MAX(INVOICE_TOTAL)
1 224
2 21842
3 90.36
4 1575
5 2433
6 17.5
7 46.21
8 662
9 9.95
10 41.8
11 503.2
12 20551.18
13 6
14 1962.13
15 2312.2
16 158
Textbook solution
SELECT SUM(invoice_max) AS sum_of_maximums
FROM (SELECT vendor_id, MAX(invoice_total) AS invoice_max
      FROM invoices
      WHERE invoice_total - credit_total - payment_total > 0
      GROUP BY vendor_id)

Whole query output:

SUM_OF_MAXIMUMS
1 52434.53

Subquery output:

VENDOR_ID INVOICE_MAX
1 37 224
2 72 21842
3 80 90.36
4 83 1575
5 86 2433
6 94 17.5
7 95 46.21
8 96 662
9 97 9.95
10 102 41.8
11 106 503.2
12 110 20551.18
13 115 6
14 121 1962.13
15 122 2312.2
16 123 158

When I ran the subqueries on their own, they return the same values for MAX(invoice_total), but running the scripts in their entirety creates two different final outputs, which surprised me. Sure the methods are different but I don't know enough to understand why the results should be. Have I done something erroneous or should I expect that subqueries behave differently under FROM and WHERE?


Solution

  • The problem you observe when executing your incorrect query is that the same invoice which you determined as highest invoice of one vendor id appears for other vendor id's where it is not the maximum. The correct query only fetches the highest invoice per vendor id and sums them, but your query sums all other occurences, too.

    You can find invoices and their vendor id's that appear multiple times with following query:

    SELECT
      invoice_total, vendor_id
    FROM invoices
    WHERE invoice_total
    IN
    (SELECT 
      invoice_total
    FROM invoices
    GROUP BY 
      invoice_total
    HAVING 
      COUNT(vendor_id) > 1)
    ORDER BY invoice_total;
    

    Or you can change your wrong query and select all columns, then sort by invoice total:

    SELECT *
    FROM invoices
    WHERE invoice_total IN
        (SELECT MAX(invoice_total)
        FROM invoices
        WHERE invoice_total - credit_total - payment_total > 0
        GROUP BY vendor_id)
    ORDER BY invoice_total;
    

    Then you will also see the "duplicated" rows that your query includes and the correct one does not.

    I created a sample fiddle based on your data where you can reproduce this.