Search code examples
avro

Encode GenericData.Record field separately as encoded key


I am trying to use Avro to encode key / value pairs, but can't figure out how to encode just a single field in a schema / GenericData.Record in order to make the key.

Take this simple schema:

{"name":"TestRecord", "type":"record", "fields":[
  {"name":"id", "type":"long"},
  {"name":"name", "type":"string"},
  {"name":"desc", "default":null, "type":["null","string"]}
]}

I am encoding records like this:

val testRecordSchema = schemaParser.parse(testRecordSchemaString)
val writer = new GenericDatumWriter[GenericRecord](testRecordSchema)
val baos = new ByteArrayOutputStream()
val encoder = EncoderFactory.get().binaryEncoder(baos, null)
val record = new org.apache.avro.generic.GenericData.Record(schema)
record.put("id", 1L)
record.put("name", "test")
writer.write(record, encoder)
encoder.flush

But now say I wanted to encode separately just the id field, to use as the key, and I want to do it by name because sometimes I want to use the name field as the key instead of the id.

I tried multiple permutations of GenericDatumWriter. GenericDatumWriter has a method called writeField that looks promising, but it is protected. Otherwise it looks like you have to write complete records.

I could wrap my field in a new record type defined in a new schema, for example:

{"name":"TestRecordKey", "type":"record", "fields":[
  {"name":"id", "type":"long"}
]}

I'm 100% sure I can make that work, but then I have to create a new record type and manage it for every key field. That's not minor, and it really seems like there should be some more simple way to do this.


Solution

  • As it turns out, it wasn't that difficult just to create a new record-type schema with only one field -- the field I want to use as the key, like the example I have above:

    {"name":"TestRecordKey", "type":"record", "fields":[
      {"name":"id", "type":"long"}
    ]}
    

    I do it on the fly, as I initialize my Schema.Parser with the payload schemas -- I just create the key schema based on the payload schema programmatically.

    Was hoping for a less long-hand solution, but this works. I'll still upvote and accept any solution that is cleaner.