Search code examples
sqlgoogle-bigqueryrow-number

Finding Duplicate Records Only Using ROW_NUMBER() in Google Bigquery


I want to find "name_org1" records that are duplicated - no unique records. I am operating in Google Bigquery.

To do this I decided to use the ROW_NUMBER() windows function to rank all records, with duplicates being ranked in sequential order. Everything seems to work, except I cannot filter out unique records. In other words, I want my data set to only include ranks that are sequential (1,2, or 1,2,3, or 1,2,3,4, etc).

I do not want any one-to-ones (ie, unique records on each row). Can't seem to figure out how to exclude these. Tried using ChatGPT but I'm not satisfied.

Here's my query thus far, which yields the duplicates I'm looking for, but also includes the unique records:

SELECT
partner, name_org1, 
  ROW_NUMBER() OVER(PARTITION BY name_org1) as duplicates
FROM `gcp-gfs-datalake-core-prd.sap_p05__saphanadb__views_current.but000`

I feel I've exhausted most of the options I know how to use, including using CTEs or aggregates, but I'm at a loss right now.

Appreciate the help.


Solution

  • Looks like you need COUNT(*) window function

    SELECT
      b.partner,
      b.name_org1
    FROM (
        SELECT *,
          COUNT(*) OVER (PARTITION BY name_org1) as count
        FROM `gcp-gfs-datalake-core-prd.sap_p05__saphanadb__views_current.but000` b
    ) b
    WHERE b.count > 1;