Search code examples
google-bigquerydml

Google BiqQuery: How delete dublicate rows from table without any extra colum


I have table with two columns, where there are duplicates:

select record_id,
       session_id,
       count(record_id) as total_duplicate_records
from `project.dataset.table` 
group by 1,2
order by 3 desc

The output looks like: enter image description here

I saw a similar question: BigQuery Standard SQL: Delete Duplicates from Table but an answer based on some extra column to be used to serve as a tie-breaker. In my case there is no such column...

The goal is to run a scheduled query that will delete all duplicate records once a day


Solution

  • You can use SELECT DISTINCT to deduplicate your table.

    CREATE OR REPLACE TABLE
        `project.dataset.table` -- this will overwrite your table
    AS
    
    SELECT DISTINCT
        record_id,
        session_id
    FROM
        `project.dataset.table`
    

    Create a new table

    Replace the table to be replaced or created with a new name. This will create a new table. Running the query again will recreate/overwrite the new table.

    CREATE OR REPLACE TABLE
        `project.dataset.table_deduped` -- change to a new table
    AS
    
    SELECT DISTINCT
        record_id,
        session_id
    FROM
        `project.dataset.table`
    

    Some potential drawbacks:

    • you are storing your data twice. (The duplicated and deduplicated tables)
    • You have to recreate the table each time you get new data in your original table, however you can instead of recreating it, insert new data via a merge statement

    Create a VIEW

    A view is a virtual table based on a query. The data is not stored in a new table. You query it the same way you query a table. Can be useful if the underlying table or tables gets updated and you always want an up to date 'view' of the data.

    CREATE VIEW
        `project.dataset.table_deduped` -- change to a new table
    AS
    
    SELECT DISTINCT
        record_id,
        session_id
    FROM
        `project.dataset.table`
    

    Use a cte table (WITH clause)

    Create a temporary table that you can reference inside the same query.

    WITH 
        deduped_table AS (
            SELECT DISTINCT
                record_id,
                session_id
            FROM
                `project.dataset.table`
    )
    
    SELECT 
        *
    FROM 
        deduped_table
    
    

    All the options have their pro's and cons. Depending on what you want to achieve, how your data is structured and how much you need to care about the amount of data your are querying/storing. If you only have the two columns it's probably not that big of a deal but worth noting your options in any case I rate.