Search code examples
azuregoogle-bigqueryazure-data-factoryazure-databricks

Azure Data Factory - Google BigQuery Copy Data activity not returning nested column names


I have a copy activity in Azure Data Factory with a Google BigQuery source. I need to import the whole table (which contains nested fields - Records in BigQuery).

Nested fields get imported as follows (a string containing only data values):

"{\"v\":{\"f\":[{\"v\":\"1\"},{\"v\":\"1\"},{\"v\":\"1\"},{\"v\":null},{\"v\":\"1\"},{\"v\":null},{\"v\":null},{\"v\":\"1\"},{\"v\":null},{\"v\":null},{\"v\":null},{\"v\":null},{\"v\":\"0\"}]}}"

Expected output would be something like:

{"nestedColName" : [{"subNestedColName": 1}, {"subNestedColName": 1}, {"subNestedColName": 1}, {"subNestedColName": null}, ...] }

I think this is a connector issue from Data Factory's side but am not sure how to proceed. Have considered using Databricks to import data from GBQ directly and then saving the DataFrame to sink. Have also considered querying for a subset of columns and using UNNEST where required but would rather not do this as Parquet handles both Array and Map types.

Anyone encountered this before / what did you do?


Solution

  • Solution used:

    Databricks (Spark) connector for Google BigQuery:

    https://docs.databricks.com/data/data-sources/google/bigquery.html

    This preserves schemas and nested field names.