Search code examples
c#differenceu-sqlwindowing

Calculating change in column over groups and extracting based on criteria


I am a beginner to coding in U-SQL/C#. I am stuck in a place during windowing/aggregation.

My Data looks like

Name          Date                  OrderNo    Type   Balance
one 2018-06-25T04:55:44.0020987Z      1        Drink     15       
one 2018-06-25T04:57:44.0020987Z      1        Drink     70      
one 2018-06-25T04:59:44.0020987Z      1        Drink     33       
one 2018-06-25T04:59:49.0020987Z      1        Drink     25       
two 2018-06-25T04:55:44.0020987Z      2        Drink     22       
two 2018-06-25T04:57:44.0020987Z      2        Drink     81       
two 2018-06-25T04:58:44.0020987Z      2        Drink     33       
two 2018-06-25T04:59:44.0020987Z      2        Drink     45       

In U-SQL I am adding a unique id based on combinations of name, orderno and type and for the purpose of sorting, I am adding another one including the date.

@files = 
EXTRACT
        name string,
        date DateTime,
        type string,
        orderno int,
        balance int           
FROM
@InputFile
USING new JsonExtractor();

@files2 =
    SELECT *,
       DENSE_RANK() OVER(ORDER BY name,type,orderno,date) AS group_id,
       DENSE_RANK() OVER(ORDER BY name,type,orderno) AS id
     FROM @files;

My Data now looks like this:

Name          Date                  OrderNo    Type   Balance group_id id
one 2018-06-25T04:55:44.0020987Z      1        Drink     15       1     1
one 2018-06-25T04:57:44.0020987Z      1        Drink     70       2     1
one 2018-06-25T04:59:44.0020987Z      1        Drink     33       3     1
one 2018-06-25T04:59:49.0020987Z      1        Drink     25       4     1
two 2018-06-25T04:55:44.0020987Z      2        Drink     22       5     2
two 2018-06-25T04:57:44.0020987Z      2        Drink     81       6     2
two 2018-06-25T04:58:44.0020987Z      2        Drink     33       7     2
two 2018-06-25T04:59:44.0020987Z      2        Drink     45       8     2

(I have added only 4 records per group but there are multiple per group)

I am stuck at determining the difference between successive rows in the balance column in each group.

Expected Output for Part 1:

Name          Date                  OrderNo    Type   Balance group_id id  increase
one 2018-06-25T04:55:44.0020987Z      1        Drink     15       1     1    0
one 2018-06-25T04:57:44.0020987Z      1        Drink     70       2     1    55
one 2018-06-25T04:59:44.0020987Z      1        Drink     33       3     1   -37
one 2018-06-25T04:59:49.0020987Z      1        Drink     25       4     1   -8
two 2018-06-25T04:55:44.0020987Z      2        Drink     22       5     2    0
two 2018-06-25T04:57:44.0020987Z      2        Drink     81       6     2    59
two 2018-06-25T04:58:44.0020987Z      2        Drink     33       7     2   -48
two 2018-06-25T04:59:44.0020987Z      2        Drink     45       8     2    8

For every new group (defined by id) the increase should start from zero.

I went through stack overflow and saw the lag function from transgresql. I could not find a C# equivalent. Is that applicable in this case?

Any help is appreciated. Further clarification will be provided if required.

Update: When I use CASE WHEN my solution looks like this

CURRENT OUTPUT                            DESIRED OUTPUT
id Balance Increase                     id  Balance Increase
 1  15      0                            1  15      0
 1  70     55                            1  70     55
 1  33    -37                            1  33    -37
 1  25     -8                            1  25     -8
 2  22    "-3"                           2  22     "0"
 2  81     59                            2  81     59
 2  33    -48                            2  33    -48
 2  45     12                            2  45     12

Look at the highlighted row. The increase column must start at 0 for each id.

Update: I was able to solve the first part of my question. See my answer below. The second part that I had posted earlier was incorrectly posted. I have removed that.


Solution

  • The query that finally worked for me was this..

    @files = 
    EXTRACT
            name string,
            date DateTime,
            type string,
            orderno int,
            balance int           
    FROM
    @InputFile
    USING new JsonExtractor();
    
    @files2 =
    SELECT *,
           DENSE_RANK() OVER(ORDER BY name,type,orderno) AS group_id
    FROM @files;
    
    @files3 =
    SELECT *,
           DENSE_RANK() OVER(PARTITION BY group_id ORDER BY date) AS group_order
    FROM @files2;
    
    
    @files4 =
    SELECT *,
         (CASE WHEN group_order == 1 THEN 0 
             ELSE  balance  - LAG(balance) OVER(ORDER BY name,type,orderno) 
        END) AS increase 
    FROM @files3;