sql google-cloud-platform google-bigquery data-science data-warehouse

How to compare two tables having record type column in BigQuery

I have two nested tables one is source and another is target table. I wanted to compare nested columns of source and target table. I am comparing two tables to check weather data is being updated in source table or not. Is there any sql in BigQuery to achieve the same?

Here are my approaches which I was previously doing to compare two tables with nested record:

1. This is the first approach:

SELECT to_json_string(info) FROM database.nested_table_source
except distinct
SELECT to_json_string(info) FROM nested_table_target

to_json_string() is not working, since this function returns different sequence of source rows and target rows sometimes, even though data is same in both tables, it results out different records.

2. This is the second approach:

select name
from dataset.nested_table_source a
join dataset.nested_table_target b
using(name)
where
a.name!=b.name  and
(select string_agg(format('%t', s) order by key) from a.info s) 
!= (select string_agg(format('%t', s) order by key ) from b.info s)

In this approach I am using string_agg function to compare two nested records. But I am not sure if that is the correct way to compare record fields.

What should I do in this case?

Solution

Here is an approach in which you basically stringify an ordered set of objects (or the info column in your tables), and then compare them against each other.

Here is an example with some dummy data:

with source_data as (
    select 
    "VICTOR" as name,
     array[
        struct("A" as key, 3 as value),       
        struct("B" as key, 5 as value)   
    ] as info

    union all 
    
    select 
    "MAX" as name,
     array[
        struct("A" as key, 0 as value),       
        struct("B" as key, 1 as value)   
    ] as info

    union all 
    
    select 
    "SAIF" as name,
     array[
        struct("A" as key, 0 as value),       
        struct("B" as key, 1 as value)   
    ] as info
),
target_data as (
    select 
    "VICTOR" as name,
     array[
        struct("A" as key, 3 as value),       
        struct("B" as key, 15 as value)   
    ] as info

    union all 
    
    select 
    "MAX" as name,
     array[
        struct("A" as key, 0 as value),       
        struct("B" as key, 1 as value)   
    ] as info
)

select name, stringified_source_set as info from (
    select 
        s.name, 
        array_to_string(array(select concat(cast(x.key as string), '|', cast(x.value as string)) from unnest(t.info) as x order by x.key), '|') AS stringified_target_set,
        array_to_string(array(select concat(cast(x.key as string), '|', cast(x.value as string)) from unnest(s.info) as x order by x.key), '|') AS stringified_source_set
    from source_data s
    left join target_data t on t.name = s.name 
)
where (stringified_source_set != stringified_target_set) or (name is null)

Note that the above approach does achieve both "horizontal comparison" (i.e. comparing both info objects) and "vertical comparison" (i.e. comparing missing entries in the target table which exist in the source table).