Search code examples
sqlsql-serverdata-analysis

SQL Count by Year


I'm trying to write a query to compare the number of customers with visits and the number of customers who buy per year. If I do the query for the individual year without Group By, the number is correct, but I need it as a correct table in the system to be able to continue working with it.

This is the one with the incorrect results.

SELECT      
    COUNT (DISTINCT a.I_CUSTOMER_M) AS 'Visited Customer',
    COUNT (DISTINCT c.S_CUSTNO) AS 'Buying Customer',
    year ( i.D_INVOICEDATE) AS 'Year'

FROM
     sao.customer_M c
    LEFT JOIN sao.invoice_p i 
        on i.I_CUSTOMER_M = c.I_CUSTOMER_M
    LEFT  JOIN sao.ACTIVITY_P a
        on  (i.I_customer_M = a.I_CUSTOMER_M and a.I_ACTIVTYPE_C = 11 
    OR i.I_CUSTOMER_M IS NULL )

WHERE 
    i.i_invoice_p>0 AND 
    i.B_PROFORMA = 0 and
    i.B_CREDITNOTE = 0 and
    i.B_CANCEL = 0 and
    i.dt_deleted IS NULL AND
    datepart(yyyy,getdate()) - datepart(yyyy,i.D_INVOICEDATE) <= 2 

GROUP BY
    year (i.D_INVOICEDATE)
Visited Customer Buying Customer Year
507 1490 2021
509 1452 2022
438 1143 2023

And here is an example if I just make the query for one single year.

SELECT  
                COUNT (DISTINCT a.I_CUSTOMER_M) AS 'Visited Customer',
                COUNT (DISTINCT c.S_CUSTNO) AS 'Buying Customer',
                year ( i.D_INVOICEDATE) AS 'Year'
            FROM
                 sao.customer_M c 
            LEFT JOIN sao.invoice_p i
                on i.I_CUSTOMER_M = c.I_CUSTOMER_M
            LEFT  JOIN sao.ACTIVITY_P a
                on  (i.I_customer_M = a.I_CUSTOMER_M and a.I_ACTIVTYPE_C = 11
                and datepart(yyyy,getdate()) - datepart(yyyy,a.D_ACTIVITYDATE) = 1  OR i.I_CUSTOMER_M IS NULL )
            WHERE 
                i.i_invoice_p>0 AND 
                i.B_PROFORMA = 0 and
                i.B_CREDITNOTE = 0 and
                i.B_CANCEL = 0 and
                i.dt_deleted IS NULL AND
                datepart(yyyy,getdate()) - datepart(yyyy,i.D_INVOICEDATE) = 1 
                GROUP BY YEAR(i.D_INVOICEDATE)
Visited Customer Buying Customer Year
235 1452 2022

So as you can see the count of the Visited Customer has a big difference. Everything else is fine, not sure what I#m doing wrong. Can someone please help me?


Solution

  • When asking questions like this it's very helpful to provide example DDL/DML. It helps folks who want to answer your question by detailing exactly what your data looks like.

    Here's some pretty basic example DDL/DML which is hopefully enough to show where I think you've gotten hung up.

    DECLARE @Activities TABLE (ActivityID BIGINT IDENTITY, CustomerID BIGINT, ActivityDate DATETIME);
    DECLARE @Invoices   TABLE (InvoiceID BIGINT IDENTITY, CustomerID BIGINT, InvoiceDate DATETIME);
    DECLARE @Customers  TABLE (CustomerID BIGINT IDENTITY, FirstName NVARCHAR(30), LastName NVARCHAR(30));
    INSERT INTO @Customers (FirstName, LastName) VALUES
    ('Margaret', 'Ramirez'),('Brenda', 'Bailey'),('Helen', 'Richardson'),('Timothy', 'Carter'),('Jessica', 'Thomas'),('Jason', 'Diaz'),('Ryan', 'Collins'),('Christopher', 'Lee'),('Jerry', 'Foster'),('Emily', 'Scott'),('Barbara', 'Lopez'),
    ('Jennifer', 'Garcia'),('Jacob', 'Stewart'),('David', 'Rodriguez'),('Ronald', 'Gomez'),('Anna', 'Cooper'),('Catherine', 'Myers'),('Daniel', 'Thompson'),('Christine', 'Bennet'),('Cynthia', 'Reyes'),
    ('Brian', 'Hall'),('Kathleen', 'Morris'),('Anna', 'Cooper'),('Benjamin', 'Watson'),('Benjamin', 'Watson'),('Samuel', 'Chavez'),('Donald', 'Walker'),('Angela', 'Rogers'),('Dennis', 'Long'),('Elizabeth', 'Martinez');
    
    INSERT INTO @Activities (CustomerID, ActivityDate) 
    SELECT TOP 50 ROUND((30 * Rnd1 + 1), 0), DATEADD(DAY,-ROUND((1095 * Rnd2 + 0), 0),GETDATE())
      FROM (VALUES (RAND(CONVERT(VARBINARY,NEWID(),1)),RAND(CONVERT(VARBINARY,NEWID(),1)))) a (Rnd1, Rnd2)
        CROSS APPLY sys.sysobjects b
        CROSS APPLY sys.sysobjects c
    INSERT INTO @Invoices (CustomerID, InvoiceDate) 
    SELECT TOP 50 ROUND((30 * Rnd1 + 1), 0), DATEADD(DAY,-ROUND((1095 * Rnd2 + 0), 0),GETDATE())
      FROM (VALUES (RAND(CONVERT(VARBINARY,NEWID(),1)),RAND(CONVERT(VARBINARY,NEWID(),1)))) a (Rnd1, Rnd2)
        CROSS APPLY sys.sysobjects b
        CROSS APPLY sys.sysobjects c;
    

    All we're doing here is creating a few table variables, and populating them with some random data. The customers are just random pairings of first and last names, and the invoices/activities are randomly generated dates at some point between today and three years ago.

    I think you're getting incorrect values because of the relationship between the data only being valid where there's commonality, perhaps where a customer visited and purchased? If you segregate those aggregates first, and then bring them back together you should find the numbers you're looking for:

    ;WITH InvoiceCounts AS (
    SELECT DATEPART(YEAR,InvoiceDate) AS Year, COUNT(DISTINCT c.CustomerID) AS Count
      FROM @Customers c
        INNER JOIN @Invoices i
          ON c.CustomerID = i.CustomerID
     WHERE i.InvoiceID > 0
     GROUP BY DATEPART(YEAR,InvoiceDate)
    ), ActivityCounts AS (
    SELECT DATEPART(YEAR,ActivityDate) AS Year, COUNT(DISTINCT c.CustomerID) AS Count
      FROM @Customers c
        INNER JOIN @Activities a
          ON c.CustomerID = a.CustomerID
     WHERE a.CustomerID > -1
     GROUP BY DATEPART(YEAR,ActivityDate)
    ), Years AS (
    SELECT Year FROM InvoiceCounts
    UNION
    SELECT Year FROM ActivityCounts
    )
    
    SELECT a.Count AS VisitingCustomers, i.Count AS BuyingCustomers, y.Year AS Year
      FROM Years y
        LEFT OUTER JOIN InvoiceCounts i
          ON y.Year = i.Year
        LEFT OUTER JOIN ActivityCounts a
          ON y.Year = a.Year
    

    First we aggregate up the data from Invoices, and then the same for visits. The third CTE just gives us a list of years with visits or activity. Finally, we use the CTEs and join them on the year.

    You do have some work to do here, to make this applicable to your data set. You should be able to take your queries and replace the collection CTEs with appropriate queries.

    no data table as the results are random and won't match anyway ;) the WHERE for the collection CTEs serves no purpose but as an example.