Search code examples
azure-data-explorerkql

How to summarize data with arg_max() in KQL using two columns?


I have a table with following columns:

  • ID: identifies an imported document (think filename). This is unique for a combination of ImportId and ImportTime.
  • SomeData: some data column. In the real table there are more columns.
  • ImportId: an ID in the format of YYYY-MM-DD, eg "2022-05-14" (this is a string column)
  • ImportTime: the date and time the import was done (this is a string column)

The RowNum is NOT part of the table but used here to be able to reference records/rows.

RowNum ID Value ImportId ImportTime
1 A Doc A content as of May 11, 2022 2022-05-11 2022-05-11 13:00
2 B Doc B content as of May 11, 2022 2022-05-11 2022-05-11 13:00
3 A Doc A content as of May 11, 2022 2022-05-11 2022-05-11 17:00
4 B Doc B content as of May 11, 2022 2022-05-11 2022-05-11 17:00
5 A Doc A content as of May 14, 2022 2022-05-14 2022-05-17 08:00
6 B Doc B content as of May 14, 2022 2022-05-14 2022-05-17 08:00
7 A Doc A content as of May 14, 2022 2022-05-14 2022-05-17 10:00
8 B Doc B content as of May 14, 2022 2022-05-14 2022-05-17 10:00
9 A Doc A content as of May 11, 2022 2022-05-11 2022-05-18 15:00
10 B Doc B content as of May 11, 2022 2022-05-11 2022-05-18 15:00
  • In the table above there were three imports for May 11 (ImportId = "2022-05-11") and two imports for data from May 14 (ImportId = "2022-05-14").
  • The latest import run (ImportTime) was at 2022-05-18 15:00
  • The latest ImportTime does not necessarily correlate with the latest import data. In my example above, someone ran an import on May 18 at 15:00 but imported the state of the catalog as it was on May 11 (ImportId = "2022-05-11").

Challenge: I need to get the records with the newest ImportId (which would be "2022-05-14") and the latest ImportTime (which would be "2022-05-18 15:00").

For the example above, the result should contain the two rows with ImportId "2022-05-14" and ImportTime "2022-05-17 10:00" (row numbers 7 and 8).

What I tried:

Approach 1

I used arg_max() on ImportTime:

T
| summarize arg_max(ImportTime, *) by ID

This returns the last two rows (9 and 10), where ImportId is "2022-05-11". That's not what I'm after because the newest ImportId is "2022-05-14".

Approach 2

If I use arg_max(ImportId, *) by ID instead, I am getting the ones for "2022-05-14" (rows 5 and 6), but not the ones with the latest ImportTime.

Approach 3

I combined ImportTime and ImportId into an extended column and applied arg_max() on that. This seems to work but I'm unsure if it's correct in all cases?

T
| extend Combined = strcat(ImportId, ImportTime)
| summarize arg_max(Combined, *) by ID

This returns the expected rows 7 and 8 for "2022-05-14" at the import time of "2022-05-17 10:00".

Are there better options?


Solution

  • Check out top-nested operator:

    datatable(Value:string, ImportId:datetime, ImportTime:datetime)
    [
        "A",    datetime(2022-05-11),   datetime(2022-05-11 13:00),
        "B",    datetime(2022-05-11),   datetime(2022-05-11 13:00),
        "A",    datetime(2022-05-11),   datetime(2022-05-11 17:00),
        "B",    datetime(2022-05-11),   datetime(2022-05-11 17:00),
        "A",    datetime(2022-05-14),   datetime(2022-05-17 08:00),
        "B",    datetime(2022-05-14),   datetime(2022-05-17 08:00),
        "A",    datetime(2022-05-14),   datetime(2022-05-17 10:00),
        "B",    datetime(2022-05-14),   datetime(2022-05-17 10:00),
        "A",    datetime(2022-05-11),   datetime(2022-05-18 15:00),
        "B",    datetime(2022-05-11),   datetime(2022-05-18 15:00)
    ]
    | top-nested of Value by ignore=max(1),
      top-nested 1 of ImportId by max(ImportId),
      top-nested 1 of ImportTime by max(ImportTime)
    | project Value, ImportId, ImportTime
    
    Value ImportId ImportTime
    A 2022-05-14 00:00:00.0000000 2022-05-17 10:00:00.0000000
    B 2022-05-14 00:00:00.0000000 2022-05-17 10:00:00.0000000