Search code examples
sqlsql-serverwindow-functions

SQL - How to do Window function if there is NULL value?


First, I have this information:

  1. Weight A
  2. Weight B
  3. Relationship B to A: 1 to Many

Such that, below result can be obtained:

A_Id Weight A Weight B B_Id
1 3 16 1
2 5 16 1
3 6 16 1
4 7 16 1
5 2 12 2
6 6 12 2

Now, adding two more columns: Sum Weight A By B_Id, Accumulative Difference (consider below as table t2)

A_Id Weight A Sum Weight A By B_Id Weight B B_Id Accumulative Diff
1 3 21 16 1 5
2 5 21 16 1 5
3 6 21 16 1 5
4 7 21 16 1 5
5 2 8 12 2 1
6 6 8 12 2 1

For example above,

  1. first row accumulative difference => 21 - 16 = 5

  2. the fifth row accumulative difference => (21 + 8) - (16 + 12) = 1

So, my objective, is to compute such 'Accumulative Difference' The entire result is to be displayed in the report.

Technically, by using 'Window Functions', this can be achieved without problem. First, I have to create 2 more columns: Accumulate Weight A By B_Id, Accumulate Weight B. Then, just find the difference between the two.

I actually need 3 more columns:

  • [Row By B_Id]
  • [Sum Weight A By B_Id]
  • [Accumulate Weight B]
A_Id Weight A Sum Weight A By B_Id Weight B B_Id Row By B_Id Accumulate Weight A By B_Id Accumulate Weight B Accumulative Diff
1 3 21 16 1 1 21 16 5
2 5 21 16 1 2 21 16 5
3 6 21 16 1 3 21 16 5
4 7 21 16 1 4 21 16 5
5 2 8 12 2 1 29 28 1
6 6 8 12 2 2 29 28 1

sample SQL (to generatet2):

SELECT *, [Accumulate Weight A By B_Id] = SUM(WeightA) OVER (PARTITION BY ... ORDER BY B_Id), [Accumulate Weight B] = SUM(WeightB) OVER (PARTITION BY ... ORDER BY B_Id) FROM t2 -- (...) could be by date year month -- Accumulate Weight B can set to only 1st row, etc
;WITH tableA AS (
SELECT [A_Id] = 1, [Weight] = 3, [B_Id] = 1, [date] = '2021-10-01'
UNION
SELECT [A_Id] = 2, [Weight] = 5, [B_Id] = 1, [date] = '2021-10-02'
UNION
SELECT [A_Id] = 3, [Weight] = 6, [B_Id] = 1, [date] = '2021-10-03'
UNION
SELECT [A_Id] = 4, [Weight] = 7, [B_Id] = 1, [date] = '2021-10-04'
UNION
SELECT [A_Id] = 5, [Weight] = 2, [B_Id] = 2, [date] = '2021-10-05'
UNION
SELECT [A_Id] = 6, [Weight] = 6, [B_Id] = 2, [date] = '2021-10-06'
    
--Uncomment for testing NULL value
--UNION
--SELECT [A_Id] = 7, [Weight] = 9, [B_Id] = NULL, [date] = '2021-10-07'
--UNION
--SELECT [A_Id] = 8, [Weight] = 10, [B_Id] = 3, [date] = '2021-10-08'
    
),
tableB AS (
     SELECT [B_Id] = 1, [Weight] = 16, [date] = '2021-10-03'
     UNION
     SELECT [B_Id] = 2, [Weight] = 12, [date] = '2021-10-06'

    --Uncomment for testing NULL value
    --UNION
    --SELECT [B_Id] = 3, [Weight] = 8, [date] = '2021-10-08'
),
t1a AS (
    SELECT 
        [A_Id] = tableA.A_Id,
        [WeightA] = tableA.Weight,
        [WeightB] = tableB.Weight,
        [B_Id] = tableB.B_Id,
        [Row By B_Id] = ROW_NUMBER() OVER(PARTITION BY tableB.B_Id ORDER BY A_Id)
    FROM 
        tableA 
    FULL JOIN tableB ON tableA.B_Id = tableB.B_Id
),
t1b AS (
    SELECT
        *,
        [Sum Weight A By B_Id] = SUM(WeightA) OVER (ORDER BY B_Id),
        [Accumulate Weight B] = SUM(CASE WHEN [Row By B_Id] = 1 THEN WeightB ELSE 0 END) OVER (ORDER BY B_Id)
    FROM t1a
),
t2 AS (
    SELECT 
        *,
        [Accumulate Difference] = [Sum Weight A By B_Id] - [Accumulate Weight B]
    FROM t1b
)
SELECT 
    *
FROM t2

Now, the problem comes, IF one of the B_Id is NULL. (Uncomment the parts to generate NULL B_Id)

Below is my expected result, especially on the highlighted row:

A_Id Weight A Sum Weight A By B_Id Weight B B_Id Accumulate Weight A By B_Id Accumulate Weight B Accumulative Diff
1 3 21 16 1 21 16 5
2 5 21 16 1 21 16 5
3 6 21 16 1 21 16 5
4 7 21 16 1 21 16 5
5 2 8 12 2 29 28 1
6 6 8 12 2 29 28 1
7 9 9 0 NULL 38 28 10
8 7 10 8 3 48 36 12
9 3 10 8 3 48 36 12

However, with my sample query, this is not working. Instead, below appears:

current result

The NULL B_Id appears on the first row instead. (The order is messed up)

So my question, how to handle such situation? (Keep the original row as in the expected result)

As why the order is in such as way? (raised by @ThorstenKettner)

The order by default, is based B_TransactionDatetime. If B_Id is NULL, then it will be based on A_TransactionDatetime. So, I compute another column RefDateTime = COALESCE(B_TransactionDatetime, A_TransactionDatetime), and order based on that.

PS:

Inspired by @ThorstenKettner, i should use the RefDateTime in the window function, i.e:

[Sum Weight A By B_Id] = SUM(WeightA) OVER (ORDER BY RefDateTime),
[Accumulate Weight B] = SUM(CASE WHEN [Row By B_Id] = 1 THEN WeightB ELSE 0 END) OVER (ORDER BY RefDateTime)

Case closed.


Solution

  • You want to outer join B to A, because not every A has an associated B.

    Then you look at the rows blockwise. One block is either all rows.belonging to one B or a single A row that doesn't have a B. The b_id would make a good group key for the former, while the a_id would be appropriate for the latter. For the combined key there are different options. COALESCE(b_id, a_id) isn't one of them, because we could have an a_id 1 and a b_id 1 in the result set, but don't want them in the same group. One solution is a simple COALESCE(b_id, -a_id), provided your IDs cannot be negative, of course.

    Now, all your calculations are based on the aggregated groups, I.e. you are not interested in single A values when they belong to a B group. For this reason I'd aggregate immediately and only join the single A rows again at the very end.

    The order of the rows is COALESCE(b_date, a_date).

     with grouped as
        (
          select
            coalesce(b.b_id, -a.a_id) as grp_id,
            max(coalesce(b.date, a.date)) as grp_date,
            coalesce(max(b.weight), 0) as b_weight,
            sum(a.weight) as a_weight
          from a
          left join b on b.b_id =a.b_id
          group by coalesce(b.b_id, -a.a_id)
        )
        , calculated as
        (
          select
            grp_id,
            grp_date,
            b_weight,
            a_weight,
            sum(a_weight - b_weight) over (order by grp_date) as running_diff
            from grouped
        )
        select *
        from calculated c
        join a on a.b_id = c.grp_id or a.a_id = -c.grp_id
        order by c.grp_date, a.date;
    

    I hope I've got everything right. I don't have a computer at hand and am typing this on my mobile, which turns out harder than I thought :-)