Search code examples
google-bigqueryavro

BigQuery Avro load job with useAvroLogicalTypes incorrect field types


I'm using node.js to:

  • create an Avro file using the avsc NPM package
  • upload to GCS using the @google-cloud/storage NPM package
  • call the BQ API using the @google-cloud/bigquery NPM package to load the Avro from GCS to BQ

My problem is that even if I set useAvroLogicalTypes when I create the load job, my date data is never correctly created as TIMESTAMP in BQ, always INTEGER - which should be the normal behavior when useAvroLogicalTypes is NOT set.

According to the same docs, if timestamp-millis is set in the Avro schema definition, it should load as TIMESTAMP.

My configuration:

Avro schema

{
      "name": "metadata",
      "type": {
        "name": "metadata",
        "type": "record",
        "fields": [
          {
            "name": "creationTime",
            "type": "long",
            "logicalType": "timestamp-millis"
          },
          {
            "name": "lastActivity",
            "type": ["null", "long"],
            "logicalType": "timestamp-millis"
          },
          {
            "name": "deletionTime",
            "type": ["null", "long"],
            "logicalType": "timestamp-millis"
          },
          {
            "name": "lastSignInTime",
            "type": ["null", "long"],
            "logicalType": "timestamp-millis"
          }
        ]
      }
    }

Create load job

const metadata = {
  sourceFormat: 'AVRO',
  useAvroLogicalTypes: true
}

The only way I could get it work is by specifying the scema again in the metadata var for the load job. My question is, isn't the whole point of specifying the Avro schema and turning on useAvroLogicalTypes to avoid having to explicitly spec the schema again the the load job metadata?

Working setup

const metadata = {
    sourceFormat: 'AVRO',
    useAvroLogicalTypes: true,
    schema: {
      fields: [
        ...otherFields,
        {
          name: 'metadata',
          type: 'STRUCT',
          mode: 'REQUIRED',
          fields: [
            { name: 'creationTime', type: 'TIMESTAMP', mode: 'REQUIRED' },
            { name: 'lastActivity', type: 'TIMESTAMP', mode: 'NULLABLE' },
            { name: 'deletionTime', type: 'TIMESTAMP', mode: 'NULLABLE' },
            { name: 'lastSignInTime', type: 'TIMESTAMP', mode: 'NULLABLE' }
          ]
        }
      ]
    }
  }

Solution

  • Answering your question about specifying the Avro schema during creating the load job. It is possible to omit schema property, which is the schema for the destination table:

    The schema can be omitted if the destination table already exists, or if you're loading data from Google Cloud Datastore.

    If you want to read more about it, please refer to the documentation.

    In addition, you can also take a look to the --autodetect flag which replaces the schema definition.

    I hope it will help you.