Search code examples
sqlsql-servergroup-bysumsql-server-2017

Select only rows which the SUM of specific column is distinct to zero with group by another colum SQL Server


I want to filter a table eliminating the rows that give zero in a sum over an "amount" column and that are grouped by an Id, for example, if I have "-10, 10 and 15" for the same Id. I want to eliminate the rows that have -10 and 10 and stay with row 15

So far I was able to filter the table using GROUP BY but I need to get the ID to do a select again on the same table, also I need the SUM split.

CREATE Table #myData 
(
    Id INT, 
    DateInit date, 
    DebtorName varchar(10), 
    Trancode varchar(10), 
    Amount VARCHAR(50), 
    DebtorID varchar(50), 
    ClientRef NVARCHAR(256)
)

INSERT INTO #myData 
VALUES
 (1, '1/1/2022', 'BURNETTE', '13 P', '-53.42', '23916762', 'T081819513107-00009') 
,(2, '1/1/2022', 'BURNETTE', '13 P', '-28.83', '23916761', 'T081819513107-00008') 
,(3, '1/1/2022', 'BURNETTE', '13 P', '-12.08', '23916764', 'T081819513107-00010') 
,(4, '1/1/2022', 'BURNETTE', '1 P', '12.08', '23916764', 'T081819513107-00010') 
,(5, '1/1/2022', 'BURNETTE', '1 P', '28.83', '23916761', 'T081819513107-00008') 
,(6, '1/1/2022', 'BURNETTE', '1 P', '53.42', '23916762', 'T081819513107-00009') 
,(7, '1/1/2022', 'BURNETTE', '2 P', '66.41', '23916762', 'T081819513107-00009') 
,(8, '1/1/2022', 'BURNETTE', '2 P', '71.10', '23916762', 'T081819513107-00009') 
,(9, '1/1/2022', 'BURNETTE', '2 P', '12.08', '23916764', 'T081819513107-00010') 

SELECT 
    DebtorID, 
    SUM(CASE 
            WHEN TRY_PARSE(A.Amount AS float) IS NULL
                THEN (-1) * CAST(SUBSTRING(A.Amount, CHARINDEX('-', A.amount) + 1, LEN(A.Amount)) AS float)
                ELSE CAST(A.Amount AS float)
        END) AS amountSum
FROM 
    #myData A
GROUP BY 
    DebtorID
HAVING 
    SUM(CASE 
            WHEN TRY_PARSE(A.Amount AS float) IS NULL
                THEN (-1) * CAST(SUBSTRING(A.Amount, CHARINDEX('-', A.Amount) + 1, LEN(A.Amount)) AS float)
                ELSE CAST(A.Amount AS float)
        END) != 0

But this query returns:

DebtorID amountSum
23916762 137.51
23916764 12.08

The ultimate goal is to filter these results and save them in another table identical to the original (#myData).

This is what I want to achieve for input #myData :

Id DateInit DebtorName Trancode amount DebtorID ClientRef
6 1/1/2022 BURNETTE 2 P 66.41 23916762 T081819513107-00009
8 1/1/2022 BURNETTE 2 p 71.10 23916762 T081819513107-00009
8 1/1/2022 BURNETTE 2 p 12.08 23916764 T081819513107-00010

I would appreciate any help, thanks!

Update 2 - new scenario

thanks for the help, but this scenario has to have the same output.

Id DateInit DebtorName Trancode amount DebtorID ClientRef
1 1/1/2022 BURNETTE 13 P -53.42 23916762 T081819513107-00009
2 1/1/2022 BURNETTE 13 P -12.08 23916764 T081819513107-00010
3 1/1/2022 BURNETTE 1 p 12.08 23916764 T081819513107-00010
4 1/1/2022 BURNETTE 1 p 28.83 23916761 T081819513107-00008
5 1/1/2022 BURNETTE 1 p 66.41 23916762 T081819513107-00009
6 1/1/2022 BURNETTE 13 p -28.83 23916761 T081819513107-00008
7 1/1/2022 BURNETTE 1 p 53.42 23916762 T081819513107-00009
8 1/1/2022 BURNETTE 1 p 71.10 23916762 T081819513107-00009
9 1/1/2022 BURNETTE 2 p 12.08 23916764 T081819513107-00010

the expected result output:

Id DateInit DebtorName Trancode amount DebtorID ClientRef
5 1/1/2022 BURNETTE 1 p 66.41 23916762 T081819513107-00009
8 1/1/2022 BURNETTE 1 p 71.10 23916762 T081819513107-00009
9 1/1/2022 BURNETTE 2 p 12.08 23916764 T081819513107-00010

Solution

  • If you need the original data set but want to include an aggregate or want to filter or sort based on an aggregate then Window Functions can help you out a lot.

    UPDATE: The nature of the question has changed slightly, however the original advice is still the same, we don't want to use GROUP BY because we lose the detail rows, if we must do this in a sing expression then Window Functions provide a simple syntax solution

    In this solution I've used a CTE to separate out the formatting of the amount column and the calculation of the aggregate, but you could have used CROSS APPLY for the formatting.

    Notice in the solution we are NOT applying a SUM at all, we are merely matching values, this means the same solution with work for non--numeric values

    In the fiddle I've tried to detail some of the steps but ultimately, we process the data in the following sequence:

    1. Format the Amount column into a numeric value
    • This is actually a redundant but simplifies other aggregate operations you might make later
    1. Identify the unique pairs of Amount and DebtorID values using ROW_NUMBER() window function
    2. Filter out the unique pairs that match
    3. Return the processed data joined back with any additional columns or references you might need.
    WITH FormattedData as (
      SELECT ID, DebtorID, COALESCE(TRY_PARSE(amount as float)
                           , (-1)*CAST(SUBSTRING(amount,CHARINDEX('-',amount)+1,LEN(amount)) AS float)
                       ) as Amount
      FROM #myData
    )
    , DataSequenced as (
     SELECT a.ID, a.DebtorID, a.Amount, ROW_NUMBER() OVER (PARTITION BY a.DebtorID, a.Amount ORDER BY a.ID) as RN
     FROM FormattedData a
    )
    , MatchesRemoved as (
      SELECT a.ID, a.DebtorID, A.Amount
      FROM DataSequenced a
      WHERE NOT EXISTS (SELECT lkp.ID 
                        FROM DataSequenced lkp 
                        WHERE lkp.DebtorID = a.DebtorID
                          AND lkp.Amount = -a.Amount
                          AND lkp.RN = a.RN)
    )
    
    SELECT myData.ID, DateInit, DebtorName, Trancode, a.Amount, myData.DebtorID
      , myData.ClientRef
     FROM MatchesRemoved a
     INNER JOIN #myData myData ON myData.ID = a.ID
    ORDER BY ID
    

    See it in this fiddle: https://dbfiddle.uk/XY8h2uvG

    Because we are not specifically applying an aggregate on the Amount column, it doesn't need to be numeric, we can actually use simple string comparisons. You will have to profile this in your environment, but the following query is similar:

    WITH DataSequenced as (
     SELECT a.ID, a.DebtorID, a.Amount
      , ROW_NUMBER() OVER (PARTITION BY a.DebtorID, a.Amount ORDER BY a.ID) as RN
      , CASE
            WHEN LEFT(a.Amount,1) = '-' THEN RTRIM(RIGHT(a.Amount, LEN(a.Amount) - 1)) 
            ELSE RTRIM(a.Amount)
        END as AbsAmount
     FROM #myData a
    )
    , MatchesRemoved as (
      SELECT a.ID, a.DebtorID, A.Amount
      FROM DataSequenced a
      WHERE NOT EXISTS (SELECT lkp.ID 
                        FROM DataSequenced lkp 
                        WHERE lkp.DebtorID = a.DebtorID
                          AND lkp.AbsAmount = a.AbsAmount
                          AND lkp.ID <> a.ID
                          AND lkp.RN = a.RN)
    )
    SELECT myData.ID, DateInit, DebtorName, Trancode, a.Amount, myData.DebtorID
      , myData.ClientRef
     FROM MatchesRemoved a
     INNER JOIN #myData myData ON myData.ID = a.ID
    ORDER BY ID
    

    Depending on the complexity of your datasource and the indexes that might be available (or NOT available) It might make more sense to process this in multiple queries against temporary tables.


    The following are previous attempts to address the original request. I choose to leave these in due to the ambiguity of the OP title, the though process and solutions (including the fiddles) might prove helpful to others who find themselves on this page.

    Find the transactions where the SUM of the Debtor is ZERO : This was interpreted as return all transactions for Debtors who do not have a ZERO balance.

    https://dbfiddle.uk/l-64hpZF

    WITH FormattedData as (
      SELECT ID, DebtorID, COALESCE(TRY_PARSE(amount as float)
                           , (-1)*CAST(SUBSTRING(amount,CHARINDEX('-',amount)+1,LEN(amount)) AS float)
                       ) as Amount
      FROM #myData
    )
    , DataWithSum as (
     SELECT ID, DebtorID, Amount,  SUM(Amount) OVER( Partition By DebtorID) AS SumAmount
     FROM FormattedData
    )
    SELECT myData.ID, myData.DebtorID, D.Amount, D.SumAmount
      , myData.ClientRef
     FROM DataWithSum D
     -- join back if you need additional columns
     INNER JOIN #myData myData ON myData.ID = D.ID
     WHERE SumAmount > 0
    
    ID DebtorID Amount SumAmount ClientRef
    1 23916762 -53.42 137.51 T081819513107-00009
    3 23916764 -12.08 12.08 T081819513107-00010
    4 23916764 12.08 12.08 T081819513107-00010
    6 23916762 53.42 137.51 T081819513107-00009
    7 23916762 66.41 137.51 T081819513107-00009
    8 23916762 71.1 137.51 T081819513107-00009
    9 23916764 12.08 12.08 T081819513107-00010
    • I hope you can appreciate how much easier this sort of logic is if you store numeric values in a numeric typed column.

    You could also have joined the result of your GROUP BY back onto the original set, if your HAVING clause was going to remove a LOT of records, then this can be more efficient than the Window Function


    Update: Running Balance - Only show the most recent transactions since a zero balance

    If we need to balance out the transactions, then a simple solution is that we can compute the running balance for all previous rows with the same DebtorID. In this fiddle you can see the working out: https://dbfiddle.uk/DVRoMoNd

    Update #2: Filtering by >0 is not enough

    I turns out that even though the previous fiddle returned the correct answer, it wouldn't work if the transaction history had previously gone into negatives and then back to zero. The following query finds the last time that the running balance was less than or equal to zero and returns all the transactions after that: https://dbfiddle.uk/exwm9OwA

    -- if we only want the transactions since the last time the account was in balance
    WITH FormattedData as (
      SELECT ID, DebtorID, COALESCE(TRY_PARSE(amount as float)
                           , (-1)*CAST(SUBSTRING(amount,CHARINDEX('-',amount)+1,LEN(amount)) AS float)
                       ) as Amount
      FROM #myData
    )
    , DataWithRunningBalance as (
     SELECT a.ID, a.DebtorID, a.Amount, (
                                         SELECT SUM(Amount) 
                                         FROM FormattedData lkp
                                         WHERE lkp.DebtorID = a.DebtorID
                                         AND lkp.ID <= a.ID
                                        ) AS SumAmount
     FROM FormattedData a
    )
    SELECT myData.ID, myData.DebtorID, A.Amount
      , myData.ClientRef
     FROM FormattedData A
     INNER JOIN #myData myData ON myData.ID = A.ID
     WHERE A.ID > (
                   SELECT TOP 1 ID 
                   FROM DataWithRunningBalance B 
                   WHERE B.DebtorID = myData.DebtorID 
                     AND SumAmount <= 0 
                   ORDER BY ID DESC
                  )
    ORDER BY ID
    
    ID DebtorID Amount ClientRef
    7 23916762 66.41 T081819513107-00009
    8 23916762 71.1 T081819513107-00009
    9 23916764 12.08 T081819513107-00010

    The sequence of the data is important to the SumAmount column, in this case I couldn't use the timestamp because there was not enough granularity in the values, ID was a natural fit for this dataset. In your production code there might be a a more suitable time based candidate.

    The order of the negative vs positive values does not matter, SumAmount is a simple running balance with respect to time of entry.