Search code examples
structgoogle-bigqueryaggregation

BigQuery Struct Aggregation


I am processing an ETL job on BigQuery, where I am trying to reconcile data where there may be conflicting sources. I frist used array_agg(distinct my_column ignore nulls) to find out where reconciliation was needed and next I need to prioritize data per column base on the source source.

I thought to array_agg(struct(data_source, my_column)) and hoped I could easily extract the preferred source data for a given column. However, with this method, I failed aggregating data as a struct and instead aggregated data as an array of struct.

Considered the simplified example below, where I will prefer to get job_title from HR and dietary_pref from Canteen:

with data_set as (
    select 'John' as employee, 'Senior Manager' as job_title, 'vegan' as dietary_pref, 'HR' as source
    union all
    select 'John' as employee, 'Manager' as job_title, 'vegetarian' as dietary_pref, 'Canteen' as source
    union all
    select 'Mary' as employee, 'Marketing Director' as job_title, 'pescatarian' as dietary_pref, 'HR' as source
    union all
    select 'Mary' as employee, 'Marketing Manager' as job_title, 'gluten-free' as dietary_pref, 'Canteen' as source

)

select employee,
       array_agg(struct(source, job_title)) as job_title,
       array_agg(struct(source, dietary_pref)) as dietary_pref,
from data_set
group by employee

The data I get for John with regard to the job title is: [{'source':'HR', 'job_title':'Senior Manager'}, {'source': 'Canteen', 'job_title':'Manager'}] Whereas I am trying to achieve: [{'HR' : 'Senior Manager', 'Canteen' : 'Manager'}]

With a struct output, I was hoping to then easily access the preferred source using my_struct.my_preferred_source. I this particular case I hope to invoke job_title.HR and dietary_pref.Canteen.

Hence in pseudo-SQL here I imagine I would :

select employee,
        AGGREGATE_JOB_TITLE_AS_STRUCT(source, job_title).HR  as job_title,
        AGGREGATE_DIETARY_PREF_AS_STRUCT(source, dietary_pref).Canteen as dietary_pref, 
from data_set group by employee

The output would then be:

expected_output

I'd like help here solving this. Perhaps that's the wrong approach altogether, but given the more complex data set I am dealing with I thought this would be the preferred approach (albeit failed).

Open to alternatives. Please advise. Thanks

Notes: I edited this post after Mikhail's answer, which solved my problem using a slightly different method than I expected, and added more details on my intent to use a single struct per employee


Solution

  • Consider below

    select employee,
      array_agg(struct(source as job_source, job_title) order by if(source = 'HR', 1, 2) limit 1)[offset(0)].*,
      array_agg(struct(source as dietary_source, dietary_pref) order by if(source = 'HR', 2, 1) limit 1)[offset(0)].*
    from data_set
    group by employee                  
    

    if applied to sample data in your question - output is

    enter image description here

    Update:

    use below for clarified output

    select employee,
      array_agg(job_title order by if(source = 'HR', 1, 2) limit 1)[offset(0)] as job_title,
      array_agg(dietary_pref order by if(source = 'HR', 2, 1) limit 1)[offset(0)] as dietary_pref
    from data_set
    group by employee
    

    with output

    enter image description here