I have a BigQuery table with a REPEATED field made up of structured data (RECORD) made up of just two keys: KEY and VALUE. It would look like this:
[{"KEY":"TESTING_FLAG", "VALUE": "TRUE"},{"KEY":"THRESHOLD", "VALUE": "10"}]
In BigQuery the schema looks like this:
I wrote an Apache Beam java job that should had handled the equivalent field by writing it to GCS as parquet, and built an external table on top of the parquet files. However the schema came out looking like this:
I'm concerned this might not faithfully represent the same data type we had originally in the native BigQuery table. We passed to the Beam job an avro schema that defined the field as:
{
"name": "DS_GUARDRAILS",
"type": [
"null",
{
"type": "array",
"items": {
"type": "record",
"name": "DS_GUARDRAIL",
"fields": [
{
"name": "KEY",
"type": "string"
},
{
"name": "VALUE",
"type": "string"
}
]
}
}
],
"default": null
},
Should I have done this differently to mimic the original structure? Or is this just quirk of BigQuery of representing parquet data of this type through an external table?
Your later assumption is correct, this is just the way BigQuery represents the data, but it doesn't mean that the data is altered. The difference you are seeing is due to Parquet's storage optimization as it internally flattens nested structures compared to BigQuery which explicitly has a nested structure.
Adding in related resources you can reference.