We have been starting to work with Google bigQuery nested tables. we are having difficulties to identify duplicates on this table. we tried using Rownumber() function but it means that we need to dive into every record and record. we are talking about a table with more than 10 records in the basic lair of architecture and more than 400 fields
the following code is what we created so far, it gives recognition on duplicates but with as many fields we can write, as I wrote, there are over 400 fields
SELECT
count (*) AS Number_Of_Records --this will let us know how many records there is
FROM
(
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY field1, ... , fieldN) AS ranking
FROM
`data.T1`,
unnest(record1) as record1, --unnesting is a must with nested tables and records
....,
unnest(recordN) as recordN
)
WHERE
ranking=1 --duplicats >1
we would like to find a shorter way to write this query to find the number of nonduplicates rows, thanks
One approach is to use hash function to identify a record and use this hash to remove duplicates.
So Assuming your JSON data is composed of Strings you can use the following to generate a unique hash code for your records:
WITH items AS
(SELECT ["apples", "bananas", "pears", "grapes"] as list
UNION ALL
SELECT ["coffee", "tea", "milk" ] as list
UNION ALL
SELECT ["cake", "pie", NULL] as list -- Duplicate record
UNION ALL
SELECT ["cake", "pie", NULL] as list)
SELECT distinct FARM_FINGERPRINT(ARRAY_TO_STRING(list, '--')) AS text
FROM items;
This results in 3 rows and not 4 as expected (1 rows in duplicate)