Search code examples
schemaaliasavro

Avro Schema Aliases


I am using two different schemas to write data using Avro. The first one is

{
  "type" : "record",
  "name" : "DynamicFact_aSource",
  "fields" : [ {
    "name" : "user",
    "type" : {
      "type" : "record",
      "name" : "User",
      "namespace" : "foo",
      "fields" : [...]
    }
  }, {
    "name" : "event",
    "type" : {
      "type" : "record",
      "name" : "Event",
      "namespace" : "foo",
      "fields" : [...]
    }
  }, {
    "name" : "aLabel",
    "type" : [ "null", {
      "type" : "array",
      "items" : {
        "type" : "record",
        "name" : "Label",
        "namespace" : "foo",
        "doc" : "* Holds a single string.",
        "fields" : [ {
          "name" : "value",
          "type" : {
            "type" : "string",
            "avro.java.string" : "String"
          }
        } ]
      }
    } ],
    "default" : null
  } ]
}

and the second one is

{
  "type" : "record",
  "name" : "DynamicFact_bSource",
  "fields" : [ {
    "name" : "user",
    "type" : {
      "type" : "record",
      "name" : "User",
      "namespace" : "foo",
      "fields" : [...]
    }
  }, {
    "name" : "event",
    "type" : {
      "type" : "record",
      "name" : "Event",
      "namespace" : "foo",
      "fields" : [...]
    }
  }, {
    "name" : "anotherLabel",
    "type" : [ "null", {
      "type" : "array",
      "items" : {
        "type" : "record",
        "name" : "Label",
        "namespace" : "foo",
        "doc" : "* Holds a single string.",
        "fields" : [ {
          "name" : "value",
          "type" : {
            "type" : "string",
            "avro.java.string" : "String"
          }
        } ]
      }
    } ],
    "default" : null
  } ]
}

Now I want to read data from these two different schemas into one RDD. In order to do that, I am creating a new - reader - schema like so

{
    "type": "record",
    "name": "DynamicFact_multiSourceAggregation",
    "fields":
    [
        {
            "name": "user",
            "type":
            {
                "type": "record",
                "name": "User",
                "namespace": "foo",
                "fields":
                []
            }
        },
        {
            "name": "event",
            "type":
            {
                "type": "record",
                "name": "Event",
                "namespace": "foo",
                "fields":
                []
            }
        },
        {
            "name": "theMultiSourceAttributeLabel",
            "type":
            [
                "null",
                {
                    "type": "array",
                    "items":
                    {
                        "type": "record",
                        "name": "Label",
                        "namespace": "foo",
                        "doc": "* Holds a single string.",
                        "fields":
                        [
                            {
                                "name": "value",
                                "type":
                                {
                                    "type": "string",
                                    "avro.java.string": "String"
                                }
                            }
                        ]
                    }
                }
            ],
            "default": null,
            "aliases":
            [
                "aLabel",
                "anotherLabel"
            ]
        }
    ]
}

The reader schema defined user and event exactly the same way as the writers' schemas do but for the 3rd field instead of aLabel and anotherLabel that the two writer schmas use, I am using another name theMultiSourceAttributeLabel and setting aliases for it

"aliases":
            [
                "aLabel",
                "anotherLabel"
            ]

so that the reader can match Avro fields with aLabel or anotherLabel to theMultiSourceAttributeLabel. Although the user and event are loaded successfully using the reader's schema, aLabel and anotherLabel fields are not. This means that all theMultiSourceAttributeLabel fields are null.

Should Avro be able to figure out this mapping once provided with the list of aliases? Why are all theMultiSourceAttributeLabel in the RDD read null?


Solution

  • The problem here is that the writer schemas have different names. This requires the aliasing of the reader schema name as well - apart from the field name aliasing. So when defining the reader schema one must do

        def schema(): Schema = {
          val builder = SchemaBuilder
            .record(s"DynamicFact_$name")
            .aliases(getSourceSchemaAliases: _*)
            .fields()
           ...
    }
    

    this will produce

    {
        "type": "record",
        "name": "DynamicFact_multiSourceAggregation",
        "fields":[...],
        "aliases":
        [
            "DynamicFact_aSource",
            "DynamicFact_bSource"
        ]
    }
    

    which is enough for the reader schema to successfully read both writer schemas