Search code examples
apache-kafkaavroconfluent-platformconfluent-schema-registry

how to share avro schema definitions across teams


Kafka schema-registry provides a nice way to serialize and deserialize the data from Kafka using common data contract. However, the data contract (.avsc file) is the glue between the producer and consumer(s).

Once producer makes the .avsc file, it could be checked in to version control on the producer's side. Depending the language, it auto-generates classes too.

However,

  1. what would be the best mechanism for the consumer to pull down the schema definition for reference? is there anything like swaggerhub or typical api documentation portals for avro ?
  2. If we use Confluent platform, control center provides a gui to view the schema associated to a topic but it also allows the user to edit. How would it work between producer & consumer(s) teams? what would prevent the consumer or anyone from editing the schema directly on the Confluent platform ?
  3. Is this something that we need to custom build using rest-proxy?

Solution

  • You're talking about two different ways to work with Avro schemas:

    • Having schema registry store the schemas for you.
    • Generating an .avsc file and making that available to downstream consumers.

    In the first method, your producer would have an .avsc file that is used to serialize the messages and send them to Kafka, but if you're using schema registry, you don't need to worry about consumers needing the actual Avro definition, since the whole Avro schema is available from schema registry using the schema id. You don't have the actual generated classes, true, but you can still "walk" the entire message, and extract your data from that.

    In the second method, without using a schema registry, the producer uses an .avsc file to serialize the data sent to Kafka as a byte array, and that file is then made available to consumer/downstream applications, usually through source control. Of course, this means your producer and consumers have to be in sync whenever you make schema changes, or else your consumers won't be able to read the fields the producer has added or modified.

    So, if you're using schema registry, Kafka consumers, if properly configured, will pull the schema that each message requires automatically, and you can then extract the data you need. Separately, you can also get the latest schema for any topic with something like this:

      curl -X GET "http://schema-registry.company.com:8081/subjects/your_topic-value/versions/latest/schema"
    

    If, however, you are not using the schema registry, the only way to get the full schema is to have access to the .avsc file used to serialize the message, usually through source control, as mentioned above. You can also then share the auto-generated classes, if available, to deserialize your messages into classes directly.

    For more information on how to interact with Schema Registry, here's a link to the documentation: https://docs.confluent.io/current/schema-registry/schema_registry_tutorial.html#using-curl-to-interact-with-schema-registry

    And some reading on general schema compatibility and how it's handled/configured in Schema Registry - https://docs.confluent.io/current/schema-registry/avro.html