Search code examples
sortingduplicateskqlazure-data-explorersummarize

KQL: deduplication, record priority and sorting


I'm new to Kusto and trying to figure out how to de-dupe in KQL. I have a table in Kusto that stores account information. For some of the accounts, the table has 2 rows: one with updated_c == true and another row with updated_c == false. I would like to write a KQL script that would survive the row with updated_c == true in these cases of duplicate accounts. In other words, I'm trying to de-dupe this dataset where when there are 2 rows per each account, it would survive the one with updated_c == true. And if there are only one row per perspective account, it would survive it regardless of updated_c vlaues.

account Id updated_c
12 true
12 false

would anyone be able to help me learn how to de-dupe this?


Solution

  • This should do it.

    datatable
    | extend updated_c_int = toint(updated_c)
    | summarize arg_max(updated_c_int, *) by accountId
    | project accountId, updated_c
    
    • extend updated_c_int = toint(updated_c): Convert the boolean updated_c to an integer. true becomes 1 and false becomes 0.
    • summarize arg_max(updated_c_int, *) by accountId: For each accountId, get the row with the highest value of updated_c_int. If there's a tie, arg_max will return the first row it encounters.
    • project accountId, updated_c: Only select the columns accountId and updated_c in the final output.

    Given your example input, this query should return:

    accountId    updated_c
    12           true