I'm making my way through an SQL workbook, and I was wondering why these two queries return a different result? What are the differences in behavior of WHERE IN
and inline views?
SELECT SUM(invoice_total) AS sum_of_maxes
FROM invoices
WHERE invoice_total IN
(SELECT MAX(invoice_total)
FROM invoices
WHERE invoice_total - credit_total - payment_total > 0
GROUP BY vendor_id)
Whole query output:
SUM_OF_MAXES | |
---|---|
1 | 52680.48 |
Subquery output:
MAX(INVOICE_TOTAL) | |
---|---|
1 | 224 |
2 | 21842 |
3 | 90.36 |
4 | 1575 |
5 | 2433 |
6 | 17.5 |
7 | 46.21 |
8 | 662 |
9 | 9.95 |
10 | 41.8 |
11 | 503.2 |
12 | 20551.18 |
13 | 6 |
14 | 1962.13 |
15 | 2312.2 |
16 | 158 |
SELECT SUM(invoice_max) AS sum_of_maximums
FROM (SELECT vendor_id, MAX(invoice_total) AS invoice_max
FROM invoices
WHERE invoice_total - credit_total - payment_total > 0
GROUP BY vendor_id)
Whole query output:
SUM_OF_MAXIMUMS | |
---|---|
1 | 52434.53 |
Subquery output:
VENDOR_ID | INVOICE_MAX | |
---|---|---|
1 | 37 | 224 |
2 | 72 | 21842 |
3 | 80 | 90.36 |
4 | 83 | 1575 |
5 | 86 | 2433 |
6 | 94 | 17.5 |
7 | 95 | 46.21 |
8 | 96 | 662 |
9 | 97 | 9.95 |
10 | 102 | 41.8 |
11 | 106 | 503.2 |
12 | 110 | 20551.18 |
13 | 115 | 6 |
14 | 121 | 1962.13 |
15 | 122 | 2312.2 |
16 | 123 | 158 |
When I ran the subqueries on their own, they return the same values for MAX(invoice_total)
, but running the scripts in their entirety creates two different final outputs, which surprised me. Sure the methods are different but I don't know enough to understand why the results should be. Have I done something erroneous or should I expect that subqueries behave differently under FROM
and WHERE
?
The problem you observe when executing your incorrect query is that the same invoice which you determined as highest invoice of one vendor id appears for other vendor id's where it is not the maximum. The correct query only fetches the highest invoice per vendor id and sums them, but your query sums all other occurences, too.
You can find invoices and their vendor id's that appear multiple times with following query:
SELECT
invoice_total, vendor_id
FROM invoices
WHERE invoice_total
IN
(SELECT
invoice_total
FROM invoices
GROUP BY
invoice_total
HAVING
COUNT(vendor_id) > 1)
ORDER BY invoice_total;
Or you can change your wrong query and select all columns, then sort by invoice total:
SELECT *
FROM invoices
WHERE invoice_total IN
(SELECT MAX(invoice_total)
FROM invoices
WHERE invoice_total - credit_total - payment_total > 0
GROUP BY vendor_id)
ORDER BY invoice_total;
Then you will also see the "duplicated" rows that your query includes and the correct one does not.
I created a sample fiddle based on your data where you can reproduce this.