Search code examples
distinctgoogle-bigquery

Delete duplicate rows from a BigQuery table


I have a table with >1M rows of data and 20+ columns.

Within my table (tableX) I have identified duplicate records (~80k) in one particular column (troubleColumn).

If possible I would like to retain the original table name and remove the duplicate records from my problematic column otherwise I could create a new table (tableXfinal) with the same schema but without the duplicates.

I am not proficient in SQL or any other programming language so please excuse my ignorance.

delete from Accidents.CleanedFilledCombined 
where Fixed_Accident_Index 
in(select Fixed_Accident_Index from Accidents.CleanedFilledCombined 
group by Fixed_Accident_Index 
having count(Fixed_Accident_Index) >1);

Solution

  • You can remove duplicates by running a query that rewrites your table (you can use the same table as the destination, or you can create a new table, verify that it has what you want, and then copy it over the old table).

    A query that should work is here:

    SELECT *
    FROM (
      SELECT
          *,
          ROW_NUMBER()
              OVER (PARTITION BY Fixed_Accident_Index)
              row_number
      FROM Accidents.CleanedFilledCombined
    )
    WHERE row_number = 1