Search code examples
avro

What is recommended Avro type namespace/name naming scheme with respect to schema evolution?


What is recommended naming scheme for avro types, so that schema evolution works with backward and forward compatibility and schema imports? How do you name your types? How many Schema.Parser instances do you use? One per schema, one global, or any other scheme?


Solution

  • So technically you have 2 options, each has it's own benefits and drawbacks:

    A) do include version identifier into namespace or type name B) do NOT include version identifier into namespace or type name

    Explanation: If you want to use schema evolution, you need not to include version number, as both confluent schema registry and simple object encoding does use namespaces, and uses some sort of hash/modified crc as schema fingerprint. When deserializing bytes, you have to know writer schema, and you can then evolve it into reader schema. These two need not to have same name, as schema resolution does not use namespace or type name. (https://avro.apache.org/docs/current/spec.html#Schema+Resolution) On the otherhand, Schema.Parser cannon parse more than 1 schema, which does have same Name, which is fully qualified type of schema, ie namespace.name. So it depends on your usecase, which one do you want to use, both can be used.

    ad A) if you do include version identifier, you will be able to parse both(or all) version using same Schema.Parser, which means, that for example these schemas will be processable together in maven-avro-plugin (sorry I do not remember, whether I tested it in single configuration only, or if I did use multiple configurations also, you have to check it yourself). Another benefit is, that you can reference same type in different versions if needed. Drawback is, that after each version upgrade, the namespace and/or type name changes, and you would have to change imports in project. Schema resolution between writer and reader schema should work, and hopefully it will.

    ad B) if you do not include version identifier, only one version could be compiled by avro-maven-plugin into java files, and you cannot have one global Schema.Parser instance in project. Why you would like to have just one global instance? It would be helpful if you don't follow bad&frequest advices to use top-level union to define multiple types in one avsc file. Well, maybe it's needed in confluent registry, but if you don't use that one, you definitely don't have to use top-level union. One can use schema imports, when Schema.Parser need to process all imports first and then finally the actual type. If you use these imports, then you have to use one Schema.Parser instance for each group of type+its imports. It's little bit declarational hassle, but it relieves you from having top-level union, which has issues on its own, and it's incorrect in principle. But if your project don't need multiple versions of same schema accessible at the same time, it's probably better than A) variant, as you don't have to change imports. Also there is opened possibility of composition of schemas if you use imports. As all versions have same namespace, you can pass arbitrary version to Schema.Parser. So if there is some a-->b association in types, one can use v2 b and use it with v3 a. Not sure if that is typical usecase, but it's possible.