company | email | phone | website | address
Amar CO LLC | [email protected] | 123 | NULL | India
Amar CO | [email protected] | NULL | NULL | IND
Stacks CO | [email protected] | 910 | stacks.com | United Kingdom
Stacks CO LLC | [email protected] | NULL | NULL | UK
I want to drop the company name with CO LLC
instead want to keep Amar CO
but want all the columns from Amar CO LLC
as it has minimum NULL
values or maximum column data.
In short: De-dupe the records, remove the company name with 'ending with or matching with LLC' (case insensitive), but keep the values from both of the record which has maximum Information column.
Expected output
Amar CO | [email protected] | 123 | NULL | India
Stacks CO | [email protected] | 910 | stacks.com | United Kingdom
to give precedence to the record having minimum null values ...
Below is for BigQuery Standard SQL (query#1)
#standardSQL
select
array_agg(t
order by array_length(regexp_extract_all(to_json_string(t), ':null'))
limit 1
)[offset(0)].*
replace(regexp_replace(company, r'(?i)CO LLC', 'CO') as company)
from `project.dataset.table` t
group by company
if applied to sample data from your question - output is
In case if you want to fill all fields from all the records - you can use below (query#2)
select regexp_replace(company, r'(?i)CO LLC', 'CO') as company,
max(email) email,
max(phone) phone,
max(website) website,
max(address) address
from `project.dataset.table`
group by company
and finally - if you still want to give precedence to the record having minimum null values, but the rest of nulls replace with values from other rows - use below (query#3)
select company,
ifnull(email, max_email) email,
ifnull(phone, max_phone) phone,
ifnull(website, max_website) website,
ifnull(address, max_address) address
from (
select array_agg(t
order by array_length(regexp_extract_all(to_json_string(t), ':null'))
limit 1
)[offset(0)].*
replace(regexp_replace(company, r'(?i)CO LLC', 'CO') as company),
max(email) max_email,
max(phone) max_phone,
max(website) max_website,
max(address) max_address
from `project.dataset.table` t
group by company
)
you can test/check the difference between this and previous option by applying them to below dummy data
with `project.dataset.table` as (
select 'Amar CO LLC' company, '[email protected]' email, 123 phone, NULL website, 'India' address union all
select 'Amar CO', NULL, 222, 'amar.com', NULL union all
select 'Stacks CO LLC', '[email protected]', NULL, NULL, 'UK' union all
select 'Stacks CO', '[email protected]', 910, 'stacks.com', 'United Kingdom'
)
the last query (query#3) gives
while previous (query#2) will just give max across all rows