Search code examples
sqlsql-serverjoinsum

Join several tables together without creating duplicates messing with count and sum functions?


The database:

Sales.Customers(CustomerID ,CustomerName) 
Sales.Orders(OrderID, CustomerID) 
Sales.OrderLines(OrderID, Quantity, UnitPrice) 
Sales.Invoices(InvoiceID, OrderID, CustomerID) 
Sales.InvoiceLines(InvoiceID, Quantity, UnitPrice)

The goal is to generate the following output:

column1: list CustomerID (only orders converted into invoices) 
Column2: list of CustomerName 
Column3: number of orders by customer ID (only orders converted into invoices) 
Column4: number of invoices by customer ID 
Column5: sum (Quantity*UnitPrice) of orders by CustomerID (only orders converted into invoices) 
Column6: sum  (Quantity*UnitPrice) of invoices by Customer ID 
Column7: the difference between column 4 and column 5

I managed to join the first 4 tables and get the expected results

Select
c.CustomerID
, c.CustomerName
, COUNT(DISTINCT(o.OrderID)) AS [number orders]
, COUNT(DISTINCT(i.InvoiceID)) AS [number invoices]
, SUM(il.Quantity * il.UnitPrice) AS [TOTAL invoices]

FROM Sales.Invoices AS i
INNER JOIN Sales.InvoiceLines AS il
ON il.InvoiceID = i.InvoiceID
INNER JOIN Sales.Orders AS o
ON o.orderID = i.OrderID    
INNER JOIN Sales.Customers AS c
ON c.CustomerID = i.CustomerID

GROUP BY c.CustomerID, c.CustomerName
ORDER BY c.CustomerID

But once I joined the last one, COUNT columns were still correct but both SUM columns were not returning expected numbers (1.6x from the expected SUM).

The query returning incorrect SUM:

Select 
        c.CustomerID
        , c.CustomerName
        , COUNT(DISTINCT(o.OrderID)) AS [number orders]
        , COUNT(DISTINCT(i.InvoiceID)) AS [number invoices]
        , SUM(DISTINCT(il.Quantity * il.UnitPrice)) AS [TOTAL invoices]
        , SUM(DISTINCT(ol.Quantity * ol.UnitPrice)) AS [TOTAL orders] 
            
FROM Sales.Invoices AS i
        LEFT JOIN Sales.InvoiceLines AS il
        ON il.InvoiceID = i.InvoiceID 
        LEFT JOIN Sales.Orders AS o
        ON o.orderID = i.OrderID    
        LEFT JOIN Sales.OrderLines AS ol
        ON ol.OrderID = o.OrderID
        LEFT JOIN Sales. Customers AS c
        ON c.CustomerID = i.CustomerID
    
GROUP BY c.CustomerID, c.CustomerName
ORDER BY c.CustomerID

I tried different types of joins, different orders, subqueries, and division queries... But always ended up with some errors or incorrect results. Any help would be highly appreciated!


Solution

  • To troubleshoot, remove all the group by and sums, pick one order (using a where clause) and actually look at the detail data.

    This is all part of the basic problem solving process. You can't solve a problem looking at a summary you need to look at the detail.

    Firstly, you have "double counting"

    An order might have one invoice and three invoice lines. When you join those you get three records. That order might also have two order lines. When you join those three records to two order lines you get six (3x2) records. So now when you sum up lines or invoices you have too many records and the figure is overstated.

    Secondly, this is definitely incorrect

    SUM(DISTINCT(ol.Quantity * ol.UnitPrice))

    If seven completely unrelated order lines have the same value. it'll throw out six of them. Which is definitely incorrect.

    You can't answer the questions you need by just joining up all the tables.

    You need to do a seperate subtable (or CTE) query that summarises order value by customerid then you can join on that.

    -- this dataset is unique on customer id 
    -- so won't double count when joined to any other dataset 
    -- that is unique on customerid
    with ordersummary as (
    select o.customerid, SUM(ol.Quantity * ol.UnitPrice) as ordervalue
    from Sales.Orders AS o
    INNER JOIN Sales.OrderLines AS ol
    ON ol.OrderID = o.OrderID
    group by customerid
    ),
    -- this dataset is also unique on customer id 
    -- so won't double count when joined to any other dataset 
    -- that is unique on customerid
    invoicesummary as 
    
    (
    Select
    c.CustomerID
    , c.CustomerName
    , COUNT(DISTINCT(o.OrderID)) AS [number orders]
    , COUNT(DISTINCT(i.InvoiceID)) AS [number invoices]
    , SUM(il.Quantity * il.UnitPrice) AS [TOTAL invoices]
    
    FROM Sales.Invoices AS i
    INNER JOIN Sales.InvoiceLines AS il
    ON il.InvoiceID = i.InvoiceID
    INNER JOIN Sales.Orders AS o
    ON o.orderID = i.OrderID    
    INNER JOIN Sales.Customers AS c
    ON c.CustomerID = i.CustomerID
    GROUP BY c.CustomerID, c.CustomerName
    )
    -- now we use our CTE's
    -- and join on thoe unique customerid value. No double counting.
    SELECT
        invoicesummary.CustomerID
        , invoicesummary.CustomerName
        , invoicesummary.[number orders]
        , invoicesummary.[number invoices]
        , invoicesummary.[TOTAL invoices]
        , ordersummary.ordervalue 
    FROM invoicesummary
        INNER JOIN ordersummary 
        ON ordersummary.customerid=invoicesummary.Customerid
    ORDER BY invoicesummary.CustomerID