Search code examples
functiongoogle-bigqueryduplicatesanalytics

Most efficient way to delete duplicates from GBQ table by multiple columns


enter image description here

I have the following table (see above).

What is the most efficient way to delete duplicates by appId AND clientId so the resulting table contains only those rows with minimum DIFF

Ex: 260848 should be yandex as source

Thanks!


Solution

  • Below is for BigQuery Standard SQL

    #standardSQL
    SELECT AS VALUE ARRAY_AGG(t ORDER BY DIFF LIMIT 1)[OFFSET(0)]
    FROM `project.dataset.table` t
    GROUP BY appId, clientId