I have a table with many duplicated rows - but I only want to deduplicate rows one partition at a time.
How can I do this?
As an example, you can start with a table partitioned by date and filled with random integers from 1 to 5:
CREATE OR REPLACE TABLE `temp.many_random`
PARTITION BY d
AS
SELECT DATE('2018-10-01') d, fhoffa.x.random_int(0,5) random_int
FROM UNNEST(GENERATE_ARRAY(1, 100))
UNION ALL
SELECT CURRENT_DATE() d, fhoffa.x.random_int(0,5) random_int
FROM UNNEST(GENERATE_ARRAY(1, 100))
Let's see what data we have in the existing table:
SELECT d, random_int, COUNT(*) c
FROM `temp.many_random`
GROUP BY 1, 2
ORDER BY 1,2
That's a lot of duplicates!
We can de-duplicate one single partition using MERGE
and SELECT DISTINCT *
with a query like this:
MERGE `temp.many_random` t
USING (
SELECT DISTINCT *
FROM `temp.many_random`
WHERE d=CURRENT_DATE()
)
ON FALSE
WHEN NOT MATCHED BY SOURCE AND d=CURRENT_DATE() THEN DELETE
WHEN NOT MATCHED BY TARGET THEN INSERT ROW
Then the end result looks like this:
We need to make sure to have the same date in the SELECT
and the row with THEN DELETE
. This will delete all rows on that partition, and insert all rows from the SELECT DISTINCT
.
Inspired by:
To de-duplicate a whole table, see: