Search code examples
jsonavrogoogle-cloud-data-fusion

Transform Avro file with Wrangler into JSON in cloud Datafusion


I try to read an Avro file, make a basic transformation (remove records with name = Ben) using Wrangler and write the result as JSON file into google cloud storage. The Avro file has the following schema:

{ "type": "record", "name": "etlSchemaBody", "fields": [ { "type": "string", "name": "name" } ] }

Transformation in wrangler is the following: transformation

The following is the output schema for JSON file: output schema

When I run the pipeline it runs successfully and the JSON file is created in cloud storage. But the JSON output is empty. When trying a preview run I get the following message: warning message

Why is the JSON output file in gcloud storage empty?


Solution

  • When using the Wrangler to make transformations, the default values for the GCS source are format: text and body: string (data type); however, to properly work with an Avro file in the Wrangler you need to change that, you need to set the format to blob and the body data type to bytes, as follows:

    format

    body data type

    After that, the preview for your pipeline should produce output records. You can see my working example next:

    Edit:

    You need to set the format: blob and the output schema as body: bytes if you want to parse the file to Avro within the Wrangler, as described before, because it needs the content of the file in a binary format.

    On the other hand if you only want to apply filters (within the Wrangler), you could do the following:

    • Open the file using format: avro, see img.
    • Set the output schema according to the fields that your Avro file has, in this case name with string data type, see img.
    • Use only filters on the Wrangler (no parsing to Avro here), see img.

    And this way you can also get the desired result.