Search code examples
hadoopavroparquet

Apache Avro - Internal Representation


I am in the process of learning Apache Avro and I would like to know how is it represented internally. If I were to describe Apache Parquet for the same question, I can say each Parquet file is composed of row_groups, each row_groups contains column chunks and column chunks has multiple pages with different encodings. Finally the metadata about all of these is stored on the file footer. This file representation is clearly documented in the Github page as well in its official Apache page.

To find the same internal representation for Apache Avro I looked into multiple pages like Github page, Apache Avro's home and the book Hadoop definitive guide and many more tutorials online but I am not able to find what I am looking for. I understand Apache Avro is row oriented file format and each of the file has the schema also along with the data in the file. All of them is fine but I wanted to know how the data is further broken down for interal organization perhaps like pages for RDBMS tables.

Any pointers related to this will be highly appreciated.


Solution

  • The Avro container file format is specified in their documentation here. If you're into the whole brevity thing, then Wikipedia has a more pithy description:

    An Avro Object Container File consists of:

    • A file header, followed by
    • one or more file data blocks.

    A file header consists of:

    • Four bytes, ASCII 'O', 'b', 'j', followed by the Avro version number which is 1 (0x01) (Binary values 0x4F 0x62 0x6A 0x01).
    • File metadata, including the schema definition.
    • The 16-byte, randomly-generated sync marker for this file.

    For data blocks Avro specifies two serialization encodings, binary and JSON. Most applications will use the binary encoding, as it is smaller and faster. For debugging and web-based applications, the JSON encoding may sometimes be appropriate.

    You can verify this against their reference implementation, e.g. in DataFileWriter.java - start with the main create method and then look at the append(D datum) method.

    The binary object encoding is described in their documentation here. The encoded data is simply a traversal of the encoded object (or objects), with each object and field encoded as described in the documentation.