sql sql-server group-by sum sql-server-2017

Select only rows which the SUM of specific column is distinct to zero with group by another colum SQL Server

I want to filter a table eliminating the rows that give zero in a sum over an "amount" column and that are grouped by an Id, for example, if I have "-10, 10 and 15" for the same Id. I want to eliminate the rows that have -10 and 10 and stay with row 15

So far I was able to filter the table using GROUP BY but I need to get the ID to do a select again on the same table, also I need the SUM split.

CREATE Table #myData 
(
    Id INT, 
    DateInit date, 
    DebtorName varchar(10), 
    Trancode varchar(10), 
    Amount VARCHAR(50), 
    DebtorID varchar(50), 
    ClientRef NVARCHAR(256)
)

INSERT INTO #myData 
VALUES
 (1, '1/1/2022', 'BURNETTE', '13 P', '-53.42', '23916762', 'T081819513107-00009') 
,(2, '1/1/2022', 'BURNETTE', '13 P', '-28.83', '23916761', 'T081819513107-00008') 
,(3, '1/1/2022', 'BURNETTE', '13 P', '-12.08', '23916764', 'T081819513107-00010') 
,(4, '1/1/2022', 'BURNETTE', '1 P', '12.08', '23916764', 'T081819513107-00010') 
,(5, '1/1/2022', 'BURNETTE', '1 P', '28.83', '23916761', 'T081819513107-00008') 
,(6, '1/1/2022', 'BURNETTE', '1 P', '53.42', '23916762', 'T081819513107-00009') 
,(7, '1/1/2022', 'BURNETTE', '2 P', '66.41', '23916762', 'T081819513107-00009') 
,(8, '1/1/2022', 'BURNETTE', '2 P', '71.10', '23916762', 'T081819513107-00009') 
,(9, '1/1/2022', 'BURNETTE', '2 P', '12.08', '23916764', 'T081819513107-00010') 

SELECT 
    DebtorID, 
    SUM(CASE 
            WHEN TRY_PARSE(A.Amount AS float) IS NULL
                THEN (-1) * CAST(SUBSTRING(A.Amount, CHARINDEX('-', A.amount) + 1, LEN(A.Amount)) AS float)
                ELSE CAST(A.Amount AS float)
        END) AS amountSum
FROM 
    #myData A
GROUP BY 
    DebtorID
HAVING 
    SUM(CASE 
            WHEN TRY_PARSE(A.Amount AS float) IS NULL
                THEN (-1) * CAST(SUBSTRING(A.Amount, CHARINDEX('-', A.Amount) + 1, LEN(A.Amount)) AS float)
                ELSE CAST(A.Amount AS float)
        END) != 0

But this query returns:

DebtorID	amountSum
23916762	137.51
23916764	12.08

The ultimate goal is to filter these results and save them in another table identical to the original (#myData).

This is what I want to achieve for input #myData :

Id	DateInit	DebtorName	Trancode	amount	DebtorID	ClientRef
6	1/1/2022	BURNETTE	2 P	66.41	23916762	T081819513107-00009
8	1/1/2022	BURNETTE	2 p	71.10	23916762	T081819513107-00009
8	1/1/2022	BURNETTE	2 p	12.08	23916764	T081819513107-00010

I would appreciate any help, thanks!

Update 2 - new scenario

thanks for the help, but this scenario has to have the same output.

Id	DateInit	DebtorName	Trancode	amount	DebtorID	ClientRef
1	1/1/2022	BURNETTE	13 P	-53.42	23916762	T081819513107-00009
2	1/1/2022	BURNETTE	13 P	-12.08	23916764	T081819513107-00010
3	1/1/2022	BURNETTE	1 p	12.08	23916764	T081819513107-00010
4	1/1/2022	BURNETTE	1 p	28.83	23916761	T081819513107-00008
5	1/1/2022	BURNETTE	1 p	66.41	23916762	T081819513107-00009
6	1/1/2022	BURNETTE	13 p	-28.83	23916761	T081819513107-00008
7	1/1/2022	BURNETTE	1 p	53.42	23916762	T081819513107-00009
8	1/1/2022	BURNETTE	1 p	71.10	23916762	T081819513107-00009
9	1/1/2022	BURNETTE	2 p	12.08	23916764	T081819513107-00010

the expected result output:

Id	DateInit	DebtorName	Trancode	amount	DebtorID	ClientRef
5	1/1/2022	BURNETTE	1 p	66.41	23916762	T081819513107-00009
8	1/1/2022	BURNETTE	1 p	71.10	23916762	T081819513107-00009
9	1/1/2022	BURNETTE	2 p	12.08	23916764	T081819513107-00010

Solution

If you need the original data set but want to include an aggregate or want to filter or sort based on an aggregate then Window Functions can help you out a lot.

UPDATE: The nature of the question has changed slightly, however the original advice is still the same, we don't want to use GROUP BY because we lose the detail rows, if we must do this in a sing expression then Window Functions provide a simple syntax solution

In this solution I've used a CTE to separate out the formatting of the amount column and the calculation of the aggregate, but you could have used CROSS APPLY for the formatting.

Notice in the solution we are NOT applying a SUM at all, we are merely matching values, this means the same solution with work for non--numeric values

In the fiddle I've tried to detail some of the steps but ultimately, we process the data in the following sequence:

Format the Amount column into a numeric value

This is actually a redundant but simplifies other aggregate operations you might make later

Identify the unique pairs of Amount and DebtorID values using ROW_NUMBER() window function
Filter out the unique pairs that match
Return the processed data joined back with any additional columns or references you might need.

WITH FormattedData as (
  SELECT ID, DebtorID, COALESCE(TRY_PARSE(amount as float)
                       , (-1)*CAST(SUBSTRING(amount,CHARINDEX('-',amount)+1,LEN(amount)) AS float)
                   ) as Amount
  FROM #myData
)
, DataSequenced as (
 SELECT a.ID, a.DebtorID, a.Amount, ROW_NUMBER() OVER (PARTITION BY a.DebtorID, a.Amount ORDER BY a.ID) as RN
 FROM FormattedData a
)
, MatchesRemoved as (
  SELECT a.ID, a.DebtorID, A.Amount
  FROM DataSequenced a
  WHERE NOT EXISTS (SELECT lkp.ID 
                    FROM DataSequenced lkp 
                    WHERE lkp.DebtorID = a.DebtorID
                      AND lkp.Amount = -a.Amount
                      AND lkp.RN = a.RN)
)

SELECT myData.ID, DateInit, DebtorName, Trancode, a.Amount, myData.DebtorID
  , myData.ClientRef
 FROM MatchesRemoved a
 INNER JOIN #myData myData ON myData.ID = a.ID
ORDER BY ID

See it in this fiddle: https://dbfiddle.uk/XY8h2uvG

Because we are not specifically applying an aggregate on the Amount column, it doesn't need to be numeric, we can actually use simple string comparisons. You will have to profile this in your environment, but the following query is similar:

WITH DataSequenced as (
 SELECT a.ID, a.DebtorID, a.Amount
  , ROW_NUMBER() OVER (PARTITION BY a.DebtorID, a.Amount ORDER BY a.ID) as RN
  , CASE
        WHEN LEFT(a.Amount,1) = '-' THEN RTRIM(RIGHT(a.Amount, LEN(a.Amount) - 1)) 
        ELSE RTRIM(a.Amount)
    END as AbsAmount
 FROM #myData a
)
, MatchesRemoved as (
  SELECT a.ID, a.DebtorID, A.Amount
  FROM DataSequenced a
  WHERE NOT EXISTS (SELECT lkp.ID 
                    FROM DataSequenced lkp 
                    WHERE lkp.DebtorID = a.DebtorID
                      AND lkp.AbsAmount = a.AbsAmount
                      AND lkp.ID <> a.ID
                      AND lkp.RN = a.RN)
)
SELECT myData.ID, DateInit, DebtorName, Trancode, a.Amount, myData.DebtorID
  , myData.ClientRef
 FROM MatchesRemoved a
 INNER JOIN #myData myData ON myData.ID = a.ID
ORDER BY ID

Depending on the complexity of your datasource and the indexes that might be available (or NOT available) It might make more sense to process this in multiple queries against temporary tables.

The following are previous attempts to address the original request. I choose to leave these in due to the ambiguity of the OP title, the though process and solutions (including the fiddles) might prove helpful to others who find themselves on this page.

Find the transactions where the SUM of the Debtor is ZERO : This was interpreted as return all transactions for Debtors who do not have a ZERO balance.

https://dbfiddle.uk/l-64hpZF

WITH FormattedData as (
  SELECT ID, DebtorID, COALESCE(TRY_PARSE(amount as float)
                       , (-1)*CAST(SUBSTRING(amount,CHARINDEX('-',amount)+1,LEN(amount)) AS float)
                   ) as Amount
  FROM #myData
)
, DataWithSum as (
 SELECT ID, DebtorID, Amount,  SUM(Amount) OVER( Partition By DebtorID) AS SumAmount
 FROM FormattedData
)
SELECT myData.ID, myData.DebtorID, D.Amount, D.SumAmount
  , myData.ClientRef
 FROM DataWithSum D
 -- join back if you need additional columns
 INNER JOIN #myData myData ON myData.ID = D.ID
 WHERE SumAmount > 0

ID	DebtorID	Amount	SumAmount	ClientRef
1	23916762	-53.42	137.51	T081819513107-00009
3	23916764	-12.08	12.08	T081819513107-00010
4	23916764	12.08	12.08	T081819513107-00010
6	23916762	53.42	137.51	T081819513107-00009
7	23916762	66.41	137.51	T081819513107-00009
8	23916762	71.1	137.51	T081819513107-00009
9	23916764	12.08	12.08	T081819513107-00010

I hope you can appreciate how much easier this sort of logic is if you store numeric values in a numeric typed column.

You could also have joined the result of your GROUP BY back onto the original set, if your HAVING clause was going to remove a LOT of records, then this can be more efficient than the Window Function

Update: Running Balance - Only show the most recent transactions since a zero balance

If we need to balance out the transactions, then a simple solution is that we can compute the running balance for all previous rows with the same DebtorID. In this fiddle you can see the working out: https://dbfiddle.uk/DVRoMoNd

Update #2: Filtering by >0 is not enough

I turns out that even though the previous fiddle returned the correct answer, it wouldn't work if the transaction history had previously gone into negatives and then back to zero. The following query finds the last time that the running balance was less than or equal to zero and returns all the transactions after that: https://dbfiddle.uk/exwm9OwA

-- if we only want the transactions since the last time the account was in balance
WITH FormattedData as (
  SELECT ID, DebtorID, COALESCE(TRY_PARSE(amount as float)
                       , (-1)*CAST(SUBSTRING(amount,CHARINDEX('-',amount)+1,LEN(amount)) AS float)
                   ) as Amount
  FROM #myData
)
, DataWithRunningBalance as (
 SELECT a.ID, a.DebtorID, a.Amount, (
                                     SELECT SUM(Amount) 
                                     FROM FormattedData lkp
                                     WHERE lkp.DebtorID = a.DebtorID
                                     AND lkp.ID <= a.ID
                                    ) AS SumAmount
 FROM FormattedData a
)
SELECT myData.ID, myData.DebtorID, A.Amount
  , myData.ClientRef
 FROM FormattedData A
 INNER JOIN #myData myData ON myData.ID = A.ID
 WHERE A.ID > (
               SELECT TOP 1 ID 
               FROM DataWithRunningBalance B 
               WHERE B.DebtorID = myData.DebtorID 
                 AND SumAmount <= 0 
               ORDER BY ID DESC
              )
ORDER BY ID

ID	DebtorID	Amount	ClientRef
7	23916762	66.41	T081819513107-00009
8	23916762	71.1	T081819513107-00009
9	23916764	12.08	T081819513107-00010

The sequence of the data is important to the SumAmount column, in this case I couldn't use the timestamp because there was not enough granularity in the values, ID was a natural fit for this dataset. In your production code there might be a a more suitable time based candidate.

The order of the negative vs positive values does not matter, SumAmount is a simple running balance with respect to time of entry.