Search code examples
sqlgoogle-cloud-platformgoogle-bigquerydata-sciencedata-warehouse

How to compare two tables having record type column in BigQuery


I have two nested tables one is source and another is target table. I wanted to compare nested columns of source and target table. I am comparing two tables to check weather data is being updated in source table or not. Is there any sql in BigQuery to achieve the same?

enter image description here

Here are my approaches which I was previously doing to compare two tables with nested record:

1. This is the first approach:

SELECT to_json_string(info) FROM database.nested_table_source
except distinct
SELECT to_json_string(info) FROM nested_table_target

to_json_string() is not working, since this function returns different sequence of source rows and target rows sometimes, even though data is same in both tables, it results out different records.

2. This is the second approach:

select name
from dataset.nested_table_source a
join dataset.nested_table_target b
using(name)
where
a.name!=b.name  and
(select string_agg(format('%t', s) order by key) from a.info s) 
!= (select string_agg(format('%t', s) order by key ) from b.info s)

In this approach I am using string_agg function to compare two nested records. But I am not sure if that is the correct way to compare record fields.

What should I do in this case?


Solution

  • Here is an approach in which you basically stringify an ordered set of objects (or the info column in your tables), and then compare them against each other.

    Here is an example with some dummy data:

    with source_data as (
        select 
        "VICTOR" as name,
         array[
            struct("A" as key, 3 as value),       
            struct("B" as key, 5 as value)   
        ] as info
    
        union all 
        
        select 
        "MAX" as name,
         array[
            struct("A" as key, 0 as value),       
            struct("B" as key, 1 as value)   
        ] as info
    
        union all 
        
        select 
        "SAIF" as name,
         array[
            struct("A" as key, 0 as value),       
            struct("B" as key, 1 as value)   
        ] as info
    ),
    target_data as (
        select 
        "VICTOR" as name,
         array[
            struct("A" as key, 3 as value),       
            struct("B" as key, 15 as value)   
        ] as info
    
        union all 
        
        select 
        "MAX" as name,
         array[
            struct("A" as key, 0 as value),       
            struct("B" as key, 1 as value)   
        ] as info
    )
    
    select name, stringified_source_set as info from (
        select 
            s.name, 
            array_to_string(array(select concat(cast(x.key as string), '|', cast(x.value as string)) from unnest(t.info) as x order by x.key), '|') AS stringified_target_set,
            array_to_string(array(select concat(cast(x.key as string), '|', cast(x.value as string)) from unnest(s.info) as x order by x.key), '|') AS stringified_source_set
        from source_data s
        left join target_data t on t.name = s.name 
    )
    where (stringified_source_set != stringified_target_set) or (name is null)
    
    

    Note that the above approach does achieve both "horizontal comparison" (i.e. comparing both info objects) and "vertical comparison" (i.e. comparing missing entries in the target table which exist in the source table).