Search code examples
kqlazure-data-explorerkusto-explorer

Grouping on the basis of cumulative sum


Let's say I have the following Kusto dataset (the field category is unique as I already called out in the comment)

// assumptions:- category is unique
let threshold=100;
datatable(category:string,measure:int)
[
 'cat1',10,
 'cat2',20,
 'cat3',30,
 'cat4',40,
 'cat5',45,
 'cat6',45,
 'cat7',49,
 'cat8',50,
 'cat9',50
]
| order by measure asc

I would like to output sets of category where cumulative sum is less than or equal to the threshold , this can be calculated in ascending order of measure. So in case of this example I would expect the following output:-

group1= {cat1,cat2,cat3,cat4} its cumulative sum is 100
group2= {cat5,cat6} its cumulative sum is 90
group3= {cat7,cat8} its cumulative sum is 99
group4= {cat9} its cumulative sum is 50

I was trying to explore following kind of logic but it's not supported in Kusto:-

// assumptions:- category is unique
let threshold=100;
datatable(category:string,measure:int)
[
 'cat1',10,
 'cat2',20,
 'cat3',30,
 'cat4',40,
 'cat5',45,
 'cat6',45,
 'cat7',49,
 'cat8',50,
 'cat9',50
]
| order by measure asc
| serialize cumsum=row_cumsum(measure,prev(cumsum)+measure>threshold)

Basically here I am trying to access cumsum of previous row and restart cumsum if previous cumsum when added to the measure of the current row exceeds the threshold. But obviously it will not work , as I can't access cumsum of previous row. How do I achieve this?


Solution

  • You can achieve this using scan operator:

    let threshold=100;
    datatable(category:string,measure:int)
    [
     'cat1',10,
     'cat2',20,
     'cat3',30,
     'cat4',40,
     'cat5',45,
     'cat6',45,
     'cat7',49,
     'cat8',50,
     'cat9',50
    ]
    | order by measure asc
    | scan declare (cumulative_x:long=0, group_num:long=1) with 
    (
        step s1: true => 
                 cumulative_x = iff(s1.cumulative_x + measure <= threshold, measure + s1.cumulative_x, measure),
                 group_num =    iff(s1.cumulative_x + measure <= threshold, s1.group_num, s1.group_num + 1);
    )
    | summarize make_list(category), max(cumulative_x) by group_num
    
    group_num list_category max_cumulative_x
    1 [
    "cat1",
    "cat2",
    "cat3",
    "cat4"
    ]
    100
    2 [
    "cat5",
    "cat6"
    ]
    90
    3 [
    "cat7",
    "cat8"
    ]
    99
    4 [
    "cat9"
    ]
    50