I try to read an Avro file, make a basic transformation (remove records with name = Ben) using Wrangler and write the result as JSON file into google cloud storage. The Avro file has the following schema:
{ "type": "record", "name": "etlSchemaBody", "fields": [ { "type": "string", "name": "name" } ] }
Transformation in wrangler is the following: transformation
The following is the output schema for JSON file: output schema
When I run the pipeline it runs successfully and the JSON file is created in cloud storage. But the JSON output is empty. When trying a preview run I get the following message: warning message
Why is the JSON output file in gcloud storage empty?
When using the Wrangler to make transformations, the default values for the GCS source are format: text
and body: string
(data type); however, to properly work with an Avro file in the Wrangler you need to change that, you need to set the format to blob
and the body data type to bytes
, as follows:
After that, the preview for your pipeline should produce output records. You can see my working example next:
Edit:
You need to set the format: blob
and the output schema as body: bytes
if you want to parse the file to Avro within the Wrangler, as described before, because it needs the content of the file in a binary format.
On the other hand if you only want to apply filters (within the Wrangler), you could do the following:
format: avro
, see img.name
with string
data type, see img.And this way you can also get the desired result.