Search code examples
google-bigquerygoogle-cloud-dataflowapache-beamdataflowapache-beam-io

GCP Dataflow JOB REST response add displayData object with { "key":"datasetName", ...}


Why this code of line doesn't generate displayData object with { "key":"datasetName", ...} and how I can generate it if it's not coming by default when using BigQuery source from apache beam?

bigqcollection = p | 'ReadFromBQ' >> beam.io.Read(beam.io.BigQuerySource(project=project,query=get_java_query))

[UPDATE] Adding result that I try to produce:

"displayData": [
                    {
                        "key": "table",
                        "namespace": "....",
                        "strValue": "..."
                    },          
                    {
                        "key": "datasetName",
                        "strValue": "..."
                    }
]

Solution

  • From reading the implementation of display_data() for a BigQuerySource in the most recent version of Beam, it does not extract the table and dataset from the query, which your example uses. And more significantly, it does not create any fields specifically named datasetName.

    I would recommend writing a subclass of _BigQuerySource which adds the fields you need to the display data, while preserving all the other behavior.