Search code examples
sqlpandasgroupingpandasql

Under the impression that these two SQL queries would give the same output, yet they have wildly different results


I am using pandasql. The first turns values as expected, but the second returns things that shouldn't even exist. However, I would expect them to return the same value. The only difference, to my eye, is that in the first one, the grouping/sum occurs within the subqueries, while it occurs outside of them in the second one. What am I missing? Thank you for the help! (Outputs at bottom)

First query (the correct one)

SELECT a.'Name', a.Q1, b.Q2, (a.Q1 + b.Q2) AS Total
FROM
    (SELECT c.'Name', SUM(c.'Paid Amount') AS Q1
     FROM some_data AS c
     WHERE c.'Quarter' = 'Q1'
     GROUP BY c.'Name') AS a
JOIN
    (SELECT d.'Name', SUM(d.'Paid Amount') AS Q2
     FROM some_data AS d
     WHERE d.'Quarter' = 'Q2'
     GROUP BY d.'Name') AS b
ON a.'Name' = b.'Name'
ORDER BY Total DESC
LIMIT 5;

Second query (the bad one)

SELECT a.'Name' as Label, SUM(a.'Paid Amount') AS Q1, SUM(b.'Paid Amount') AS Q2, (SUM(a.'Paid Amount') + SUM(b.'Paid Amount')) as Total
FROM 
    (SELECT c.'Name', c.'Paid Amount'
     FROM some_data AS c
     WHERE c.'Quarter' = 'Q1') AS a
JOIN
    (SELECT c.'Name', c.'Paid Amount'
     FROM some_data AS c
     WHERE c.'Quarter' = 'Q2') AS b
ON a.'Name' = b.'Name'
GROUP BY Label
ORDER BY Total DESC
LIMIT 5;

I threw some random data together to demonstrate the problem.

Output from the first query (expected)

Output from the first query (expected)

Output from second query (problematic)

Output from second query (problematic)


Solution

  • This is what I call wishful coding.

    I hope you realize that doing the aggregation before joining produces the correct answer.

    The issue is that JOIN can both multiply the number of rows and remove rows. In your case, the issue is that one or both tables have multiple rows for name and this multiplies the number of rows. The SUM() just adds up all the values produced as a results of the JOIN.

    Note: Conditional aggregation is a much simpler way to write the query:

    SELECT c.Name,
           SUM(CASE WHEN c.Quarter = 'Q1' THEN c.PaidAmount END) AS Q1
           SUM(CASE WHEN c.Quarter = 'Q2' THEN c.PaidAmount END) AS Q2
    FROM some_data AS c
    WHERE c.Quarter IN ('Q1', 'Q2')
    GROUP BY c.Name