Search code examples
google-bigquerybigquery-udf

Duplicates in Bigquery partitioned Table


I have seen a issue with the data ,i am trying to find the duplicated data which got inserted on different day partition, now that the problem i want retain the data for older date and delete the latest ingested data.

Ex :- enter image description here

In the screenshot as you see data is same in two rows but due to some issues same data got inserted into 2021-04-28 as well, Now i want to delete all such records in 2021-04-28 partition. And retain the 04-27 data and actual rows of 04-28.

Could someone please share the query to find these duplicates and query to get rid of the duplicates that are in 2021-04-28 partition.

Below is the query i have written to identify the duplicates but looks it is not perfectly correct.

select grp_nbr,port_rgs_id,tranc_number,strt_tm,sqr_nbr,itm_nbr,trvl_dte from
(select grp_nbr,port_rgs_id,tranc_number,strt_tm,sqr_nbr,itm_nbr,trvl_sdte, row_number() over(partition by grp_nbr,port_rgs_id,tranc_number,strt_tm,sqr_nbr,itm_nbr,trvl_dte order by trvl_dte desc)rn from  table_name)a where rn=1

Solution

  • Your query is correct. In a more readable way it can be:

    SELECT * EXCEPT(rownum)
    FROM (
        SELECT 
            *,
            ROW_NUMBER() OVER (PARTITION BY grp_nbr,port_rgs_id,tranc_number,strt_tm,sqr_nbr,itm_nbr ORDER BY trvl_dte ) rownum
        FROM `yourtable`
        )
    WHERE rownum = 1
    

    EDIT: the trvl_dte column should not be included in the PARTITION BY statement. Also since you want to keep the earliest trvl_dte, you need to ORDER BY trvl_dte ASC and not DESC.