Search code examples
nestedschemarecordavro

Nesting Avro schemas


According to this question on nesting Avro schemas, the right way to nest a record schema is as follows:

{
    "name": "person",
    "type": "record",
    "fields": [
        {"name": "firstname", "type": "string"},
        {"name": "lastname", "type": "string"},
        {
            "name": "address",
            "type": {
                        "type" : "record",
                        "name" : "AddressUSRecord",
                        "fields" : [
                            {"name": "streetaddress", "type": "string"},
                            {"name": "city", "type": "string"}
                        ]
                    },
        }
    ]
}

I don't like giving the field the name address and having to give a different name (AddressUSRecord) to the field's schema. Can I give the field and schema the same name, address?

What if I want to use the AddressUSRecord schema in multiple other schemas, not just person? If I want to use AddressUSRecord in another schema, let's say business, do I have to name it something else?

Ideally, I'd like to define AddressUSRecord in a separate schema, then let the type of address reference AddressUSRecord. However, it's not clear that Avro 1.8.1 supports this out-of-the-box. This 2014 article shows that sub-schemas need to be handled with custom code. What the best way to define reusable schemas in Avro 1.8.1?

Note: I'd like a solution that works with Confluent Inc.'s Schema Registry. There's a Google Groups thread that seems to suggest that Schema Registry does not play nice with schema references.


Solution

  • Can I give the field and schema the same name, address?

    Yes, you can name the record with the same name as the field name.

    What if I want to use the AddressUSRecord schema in multiple other schemas, not just person?

    You can use multiple schemas using a couple of techniques: the avro schema parser clients (JVM and others) allow you to specify multiple schemas, usually through the names parameter (the Java Schema$Parser/parse method allows multiple schema String arguments).

    You can then specify dependant Schemas as a named type:

    {
      "type": "record",
      "name": "Address",
      "fields": [
        {
          "name": "streetaddress",
          "type": "string"
        },
        {
          "name": "city",
          "type": "string"
        }
      ]
    }
    

    And run this through the parser before the parent schema:

    {
      "name": "person",
      "type": "record",
      "fields": [
        {
          "name": "firstname",
          "type": "string"
        },
        {
          "name": "lastname",
          "type": "string"
        },
        {
          "name": "address",
          "type": "Address"
        }
      ]
    }
    

    Incidentally, this allows you to parse from separate files.

    Alternatively, you can also parse a single Union schema that references schemas in the same way:

    [
      {
        "type": "record",
        "name": "Address",
        "fields": [
          {
            "name": "streetaddress",
            "type": "string"
          },
          {
            "name": "city",
            "type": "string"
          }
        ]
      },
      {
        "type": "record",
        "name": "person",
        "fields": [
          {
            "name": "firstname",
            "type": "string"
          },
          {
            "name": "lastname",
            "type": "string"
          },
          {
            "name": "address",
            "type": "Address"
          }
        ]
      }
    ]
    

    I'd like a solution that works with Confluent Inc.'s Schema Registry.

    The schema registry does not support parsing schemas separately, but it does support the latter example of parsing into a union type.