Search code examples
google-bigqueryjinja2dbtdbt-bigquery

Looping through struct column to get conditional outcome as new columns


Say I have a table in BQ called rating with a struct column called rating_record. Schema as:

[
  {
    "name": "id",
    "mode": "NULLABLE",
    "type": "STRING",
    "description": null,
    "fields": []
  },
  {
    "name": "rating_record",
    "mode": "NULLABLE",
    "type": "RECORD",
    "description": null,
    "fields": [
      {
        "name": "high_drop",
        "type": "BOOLEAN",
        "fields": []
      },
      {
        "name": "medium_bump",
        "type": "BOOLEAN",
        "fields": []
      }
]

rating_record contains fields high_drop and medium_bump and there could be many fields with suffix _drop and _bump with true or false values. I want to iterate this record type (struct) field using a dbt macro to create two new columns against an id called drop_reasons & bump_reasons - drop_reasons in this case would be = 'high_drop' if the value is true.

I tried to iterate the record with a sql_statement and using dbt_utils.get_query_results_as_dict to get the outcome but unable to create columns accordingly.

{% set sql_statement %}
    select id, rating_record from {{ ref('source_table' }}
{% endset %}

{%- set ids_and_ratings = dbt_utils.get_query_results_as_dict(sql_statement) -%}

select

    {% for id in ids_and_ratings['id'] | unique -%}
        {% set bump_reasons = [] %}
        {% set drop_reasons = [] %}
        {% for rating_record in ids_and_ratings['rating_record'] | unique -%}
            {% for key, value in fromjson(rating_record).items() -%}
                {% if key.endswith('bump') and value is sameas true %}
                    {{ bump_reasons.append(key) }}
                {% elif key.endswith('drop') and value is sameas true %}
                    {{ drop_reasons.append(key) }}
                {% endif %}
            {% endfor %}
        {% endfor %}
        {{ print(drop_reasons) }}
    {% endfor %}

from {{ ref('source_table' }}

Solution

  • I suppose values in the fields of the table schema is a single value rather than an array. get_query_results_as_dict returns a key of the column and the value to be a list of the column value. In the jinja code the first loop iterate all the ids, for each id the second loop iterate all rating_record struct for all ids but not of the id in the first loop.
    From the logic of your jinja code, it can be done using BigQuery. The dummy data is

    insert into database.table_name values
        ("01", STRUCT(True, False, True, True)),
        ("02", STRUCT(False, False, True, False)),
        ("03", STRUCT(False, True, True, False))
    

    Added a macro to get keys in the rating_record struct

    {% macro get_struct_fields() %}
    {% set query %}
    SELECT split(field_path, ".")[OFFSET(1)] as fields
    FROM `project_id`.`region-us`.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS 
    WHERE table_name = "struct_tbl"
    and column_name = "rating_record"
    and data_type = "BOOL" 
    {% endset %}
    {% set results = run_query(query) %}
    {% if execute %}
        {% set fields = results.columns[0].values() %}
    {% else %}
        {% set fields = [] %}
    {% endif %}
    {%do log(fields, info=true) %}
    {{ return(fields)}}
    {% endmacro %} 
    

    The model is

    {% set fields = get_struct_fields() %}
    {% set fields = fields | join(", ") %}
    select *
        from
        (select id, split(bump_drop, "_")[offset(1)] as bump_drop, bump_drop as 
        reason
        from
        (select id, rating_record.* from `project_id.database.table_name`)
        unpivot (tf for bump_drop in ({{ fields }}))
        where tf is true
        )
        pivot (array_agg(reason ignore nulls) as reasons for bump_drop in ("bump", 
        "drop"))
    

    where

    select * from (
    select id, rating_record.* from `project_id.database.table_name`)
    unpivot (tf for bump_drop in (high_bump, high_drop, medium_bump, medium_drop))
    

    will result in a table with columns id, tf which contains true/false, and bump_drop with values of something_bump/something_drop.

    Then select only true part and separate out the 'bump' and 'drop'. The unpivot will put reasons for bump and drop separately into an array for each id.