sql nested duplicates google-bigquery standards

Remove duplicates from nested tables in Google BigQuery

We have been starting to work with Google bigQuery nested tables. we are having difficulties to identify duplicates on this table. we tried using Rownumber() function but it means that we need to dive into every record and record. we are talking about a table with more than 10 records in the basic lair of architecture and more than 400 fields

the following code is what we created so far, it gives recognition on duplicates but with as many fields we can write, as I wrote, there are over 400 fields

SELECT
    count (*) AS Number_Of_Records --this will let us know how many records there is
FROM 
    (
    SELECT
        *,
        ROW_NUMBER() OVER (PARTITION BY field1, ... , fieldN) AS ranking 
    FROM
        `data.T1`,
        unnest(record1) as record1, --unnesting is a must with nested tables and records
        ....,
        unnest(recordN) as recordN
    )
WHERE
    ranking=1 --duplicats >1

we would like to find a shorter way to write this query to find the number of nonduplicates rows, thanks

Solution

One approach is to use hash function to identify a record and use this hash to remove duplicates.

So Assuming your JSON data is composed of Strings you can use the following to generate a unique hash code for your records:

WITH items AS
  (SELECT ["apples", "bananas", "pears", "grapes"] as list
  UNION ALL
  SELECT ["coffee", "tea", "milk" ] as list
  UNION ALL
  SELECT ["cake", "pie", NULL] as list -- Duplicate record
  UNION ALL
  SELECT ["cake", "pie", NULL] as list)

SELECT distinct FARM_FINGERPRINT(ARRAY_TO_STRING(list, '--')) AS text
FROM items;

This results in 3 rows and not 4 as expected (1 rows in duplicate)