Search code examples
avrospark-avro

Writing an array of multiple different Records to Avro format, into the same file


We have some legacy file format, which I would need to migrate to Avro storage. The tricky part is that the records basically have

  • some common fields,
  • a discriminator field and
  • some unique fields, specific to the type selected by the discriminator field

all of them stored in the same file, without any order, fully mixed with each other. (It's legacy...)

In Java/object-oriented programming, one could represent our records concept as the following:

abstract class RecordWithCommonFields {
   private Long commonField1;
   private String commonField2;
   ...
}

class RecordTypeA extends RecordWithCommonFields {
   private Integer specificToA1;
   private String specificToA1;
   ...
}

class RecordTypeB extends RecordWithCommonFields {
   private Boolean specificToB1;
   private String specificToB1;
   ...
}

Imagine the data being something like this:

commonField1Value;commonField2Value,TYPE_IS_A,specificToA1Value,specificToA1Value
commonField1Value;commonField2Value,TYPE_IS_B,specificToB1Value,specificToB1Value

So I would like to process an incoming file and write its content to Avro format, somehow representing the different types of the records.

Can someone give me some ideas on how to achieve this?


Solution

  • Nandor from the Avro users email list was kind enough to help me out with this answer, credits go to him; this answer is for the record just in case someone else hits the same issue.

    His solution is simple, basically using composition rather than inheritance, by introducing a common container class and a field referencing a specific subclass.

    With this approach the mapping looks like this:

    {
      "namespace": "com.foobar",
      "name": "UnionRecords",
      "type": "array",
      "items": {
        "type": "record",
        "name": "RecordWithCommonFields",
        "fields": [
          {"name": "commonField1", "type": "string"},
          {"name": "commonField2", "type": "string"},
          {"name": "subtype", "type": [
            {
              "type" : "record",
              "name": "RecordTypeA",
              "fields" : [
                {"name": "integerSpecificToA1", "type": ["null", "long"] },
                {"name": "stringSpecificToA1", "type": ["null", "string"]}
              ]
            },
            {
              "type" : "record",
              "name": "RecordTypeB",
              "fields" : [
                {"name": "booleanSpecificToB1", "type": ["null", "boolean"]},
                {"name": "stringSpecificToB1", "type": ["null", "string"]}
              ]
            }
          ]}
        ]
      }
    }