Search code examples
apacheavrospark-avroavro-toolsavro4s

Schema in Avro message


I see that the Avro messages have the schema embedded, and then the data in binary format. If multiple messages are sent and new avro files are getting created for every message, is not Schema embedding an overhead? So, does that mean, it is always important for the producer to batch up the messages and then write, so multiple messages writing into one avro file, just carry one schema? On a different note, is there an option to eliminate the schema embedding while serializing using the Generic/SpecificDatum writers?


Solution

  • You are correct, there is an overhead if you write a single record, with the schema. This may seem wasteful, but in some scenarios the ability to construct a record from the data using this schema is more important than the size of the payload.

    Also take into account that even with the schema included, the data is encoded in a binary format so is usually smaller than Json anyway.

    And finally, frameworks like Kafka can plug into a Schema Registry, where rather than store the schema with each record, they store a pointer to the schema.