I am trying to use DELETE
to remove duplicate records from my BigQuery table.
I found multiple solutions to the above question but most of them use CREATE
, REPLACE
or SELECT
.
The closest solution using DELETE
I found was:
BigQuery - DELETE statement to remove duplicates
BigQuery Standard SQL: Delete Duplicates from Table
I have a follow up questions on the below solution:
#standardSQL
DELETE FROM `yourproject.yourdataset.duplicates`
WHERE STRUCT(id, loadTime) NOT IN (
SELECT AS STRUCT id, MAX(loadTime) loadTime
FROM `yourproject.yourdataset.duplicates` where id= '123'
GROUP BY id)
This statement deletes all the records from the table which does not satisfy the NOT IN
condition. For example, if my table looks like below:
Id Loadtime
123 5
123 4
456 2
321 1
The query above deletes all records except for the first row. How can I modify the query so that it deletes only the 2nd row i.e. it only deletes group by id?
The final output should be:
Id Loadtime
123 5
456 2
321 1
Below should work as per your expectation
#standardSQL
DELETE FROM `yourproject.yourdataset.duplicates`
WHERE STRUCT(id, loadTime) NOT IN (
SELECT AS STRUCT id, MAX(loadTime) loadTime
FROM `yourproject.yourdataset.duplicates`
GROUP BY id)
so, in your sample - it will delete ONLY second row
Id Loadtime
123 4