Search code examples
kqlazure-data-explorer

How to delete duplicate rows in Kusto?


I have a table in Kusto that has some duplicates, which I can determine as each unique record will have a unique Key column. I figured a simple way to delete duplicates would be with a command like this:

.delete table TABLENAME records <| TABLENAME
| summarize MaxIngestionTime = max(ingestion_time()) by Key
| join kind=inner (TABLENAME) on Key
| where ingestion_time() != MaxIngestionTime

However this fails because .delete doesn't allow summarize in the predicate. I tried getting around this by doing the summarize in a separate query and materializing it like so:

let maxingest = materialize(TABLENAME
| summarize MaxIngestionTime = max(ingestion_time()) by Key);

.delete table TABLENAME records <| TABLENAME
| join kind=inner (maxingest) on Key
| where ingestion_time() != MaxIngestionTime

But this doesn't work either as it says that it can't resolve table or column expression "maxingest". Is there any other way to accomplish this?


Solution

  • This is a possible solution that requires sorting the table by Key and then ingestion_time() and the assigning each record an index form 1 to n where n is the number of duplicates that you get. The challenge is that you can't introduce new columns and are limited by the specific functions you can call.

    Here is a sample table creation script where you ingest duplicated keys into the table:

    .create-merge table SampleTest (
        Key: int,
        Content: string
    );
    
    .append SampleTest <|
    datatable(Key: int, Content: string) [
        1, "a",
        2, "b"
    ]
    
    .append SampleTest <|
    datatable(Key: int, Content: string) [
        1, "c",
        2, "d"
    ]
    

    Here is how you delete the duplicated records, keeping the latest ones only:

    .delete table SampleTest records  <| 
    SampleTest
    | sort by Key, ingestion_time() desc
    | where row_cumsum(1,prev(Key) !=Key) > 1
    

    Here is what is happening:

    • First you serialize the records by sorting the rows by the unique Key, and then the ingestion_time() in descending order
    • Next you use the where clause and use the row_cumsum function to create an index variable. This variable increments by 1 for every record, and resets to 1 if the Key changes. Only the latest record will have a value of 1 so the duplicates are deleted.