Search code examples
arraysstructazure-databricks

How to select records in a delta table column in Azure Databricks with Map of Struct datatype?


I have a delta table in Databricks named prod.silver.platform_ctl_ingestion. It has a few columns including table_name with string data type and transform_options with below structure:

 |-- transform_options: map (nullable = true)
 |    |-- key: string
 |    |-- value: struct (valueContainsNull = true)
 |    |    |-- col_name_mappings: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- type_mappings: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- partition_duplicates_by: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- order_duplicates_by: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)

For example, when table_name is prod.silver.weather, the transform_options is:

{
"prod.bronze.weather_source_a":{"col_name_mappings":{"col_a_old":"col_a_new","col_b_old":"col_b_new"},"type_mappings":{"col_a_new":"INT","col_b_new":"TIMESTAMP"},"partition_duplicates_by":["col_a_new"],"order_duplicates_by":["_commit_version"]},
"prod.bronze.weather_source_b":{"col_name_mappings":{"col_c_old":"col_c_new","col_d_old":"col_d_new"},"type_mappings":{"col_c_new":"INT","col_d_new":"TIMESTAMP"},"partition_duplicates_by":["col_c_new"],"order_duplicates_by":["ingestion_timestamp","_commit_version"]}
}

I want to select rows where order_duplicates_by in transform_options contains _commit_version.

I tried the below query but it returns all rows.

SELECT 
    *,
    transform_values(transform_options, (k, v) -> 
    CASE
    WHEN array_contains(v.order_duplicates_by, '_commit_version') THEN
    STRUCT(v.col_name_mappings, v.type_mappings, v.partition_duplicates_by, v.order_duplicates_by)
    ELSE NULL
    END
    ) AS updated_transform_options
FROM prod.silver.platform_ctl_ingestion;

Any idea how to do that?


Solution

  • If you are filtering, then it will be easy to remove _commit_version in the array.

    Command:

    %sql
    SELECT 
        *,
        transform_values(transform_options, (k, v) -> 
        STRUCT(v.col_name_mappings, v.type_mappings, v.partition_duplicates_by, array_append(array_remove(v.order_duplicates_by, '_commit_version'), 'commit_version') as order_duplicates_by)
        ) AS updated_transform_options
    FROM control_table
    WHERE array_contains(get(map_values(transform_options), 0)['order_duplicates_by'], '_commit_version');
    

    No need to check _commit_version in transform_values.

    Enter image description here