Search code examples
google-bigqueryduplicatesdelete-row

BigQuery - DELETE statement to remove duplicates Group by Id


I am trying to use DELETE to remove duplicate records from my BigQuery table.

I found multiple solutions to the above question but most of them use CREATE, REPLACE or SELECT.

The closest solution using DELETE I found was:

BigQuery - DELETE statement to remove duplicates

BigQuery Standard SQL: Delete Duplicates from Table

I have a follow up questions on the below solution:

#standardSQL
DELETE FROM `yourproject.yourdataset.duplicates`
WHERE STRUCT(id, loadTime) NOT IN (
    SELECT AS STRUCT id, MAX(loadTime) loadTime 
    FROM `yourproject.yourdataset.duplicates` where id= '123'
    GROUP BY id)

This statement deletes all the records from the table which does not satisfy the NOT IN condition. For example, if my table looks like below:

Id      Loadtime
123        5
123        4
456        2
321        1

The query above deletes all records except for the first row. How can I modify the query so that it deletes only the 2nd row i.e. it only deletes group by id?

The final output should be:

 Id      Loadtime
 123        5
 456        2
 321        1

Solution

  • Below should work as per your expectation

    #standardSQL
    DELETE FROM `yourproject.yourdataset.duplicates`
    WHERE STRUCT(id, loadTime) NOT IN (
    SELECT AS STRUCT id, MAX(loadTime) loadTime 
    FROM `yourproject.yourdataset.duplicates` 
    GROUP BY id)  
    

    so, in your sample - it will delete ONLY second row

    Id  Loadtime     
    123 4