Search code examples
sqlgoogle-bigquerybigquery-udf

I want to de-dupe records in BigQuery with max column value on specific column with expression


company  | email | phone | website | address
Amar CO LLC | [email protected] | 123 | NULL | India
Amar CO | [email protected] | NULL | NULL | IND
Stacks CO | [email protected] | 910 | stacks.com | United Kingdom
Stacks CO LLC | [email protected] | NULL | NULL | UK

I want to drop the company name with CO LLC instead want to keep Amar CO but want all the columns from Amar CO LLC as it has minimum NULL values or maximum column data.

In short: De-dupe the records, remove the company name with 'ending with or matching with LLC' (case insensitive), but keep the values from both of the record which has maximum Information column.

Expected output

Amar CO | [email protected] | 123 | NULL | India
Stacks CO | [email protected] | 910 | stacks.com | United Kingdom

Solution

  • to give precedence to the record having minimum null values ...

    Below is for BigQuery Standard SQL (query#1)

    #standardSQL
    select 
      array_agg(t 
        order by array_length(regexp_extract_all(to_json_string(t), ':null')) 
        limit 1
      )[offset(0)].* 
      replace(regexp_replace(company, r'(?i)CO LLC', 'CO') as company) 
    from `project.dataset.table` t
    group by company 
    

    if applied to sample data from your question - output is

    enter image description here

    In case if you want to fill all fields from all the records - you can use below (query#2)

    select regexp_replace(company, r'(?i)CO LLC', 'CO') as company,
      max(email) email,
      max(phone) phone,
      max(website) website,
      max(address) address
    from `project.dataset.table`
    group by company 
    

    and finally - if you still want to give precedence to the record having minimum null values, but the rest of nulls replace with values from other rows - use below (query#3)

    select company, 
      ifnull(email, max_email) email,
      ifnull(phone, max_phone) phone,
      ifnull(website, max_website) website,
      ifnull(address, max_address) address
    from (
      select array_agg(t 
          order by array_length(regexp_extract_all(to_json_string(t), ':null')) 
          limit 1
        )[offset(0)].* 
        replace(regexp_replace(company, r'(?i)CO LLC', 'CO') as company),
        max(email) max_email, 
        max(phone) max_phone,
        max(website) max_website,
        max(address) max_address
      from `project.dataset.table` t
      group by company 
    )  
    

    you can test/check the difference between this and previous option by applying them to below dummy data

    with `project.dataset.table` as (
      select 'Amar CO LLC' company, '[email protected]' email, 123 phone, NULL website, 'India' address union all
      select 'Amar CO', NULL, 222, 'amar.com', NULL union all
      select 'Stacks CO LLC', '[email protected]', NULL, NULL, 'UK' union all
      select 'Stacks CO', '[email protected]', 910, 'stacks.com', 'United Kingdom'
    )
    

    the last query (query#3) gives

    enter image description here

    while previous (query#2) will just give max across all rows

    enter image description here