Search code examples
sqlnestedduplicatesgoogle-bigquerystandards

Remove duplicates from nested tables in Google BigQuery


We have been starting to work with Google bigQuery nested tables. we are having difficulties to identify duplicates on this table. we tried using Rownumber() function but it means that we need to dive into every record and record. we are talking about a table with more than 10 records in the basic lair of architecture and more than 400 fields

the following code is what we created so far, it gives recognition on duplicates but with as many fields we can write, as I wrote, there are over 400 fields

SELECT
    count (*) AS Number_Of_Records --this will let us know how many records there is
FROM 
    (
    SELECT
        *,
        ROW_NUMBER() OVER (PARTITION BY field1, ... , fieldN) AS ranking 
    FROM
        `data.T1`,
        unnest(record1) as record1, --unnesting is a must with nested tables and records
        ....,
        unnest(recordN) as recordN
    )
WHERE
    ranking=1 --duplicats >1

we would like to find a shorter way to write this query to find the number of nonduplicates rows, thanks


Solution

  • One approach is to use hash function to identify a record and use this hash to remove duplicates.

    So Assuming your JSON data is composed of Strings you can use the following to generate a unique hash code for your records:

    WITH items AS
      (SELECT ["apples", "bananas", "pears", "grapes"] as list
      UNION ALL
      SELECT ["coffee", "tea", "milk" ] as list
      UNION ALL
      SELECT ["cake", "pie", NULL] as list -- Duplicate record
      UNION ALL
      SELECT ["cake", "pie", NULL] as list)
    
    SELECT distinct FARM_FINGERPRINT(ARRAY_TO_STRING(list, '--')) AS text
    FROM items;
    

    This results in 3 rows and not 4 as expected (1 rows in duplicate)

    enter image description here