Search code examples
pythonparsingschemabinaryfilesdfdl

Data format binary schema description and decoding in python


I'm creating some python scripts for decoding various binary formats. Each format has a lot of different records and quite a lot of the data is encoded in specific bit ranges within specific bytes. Therefore I'm looking for a python package that neatly separates the decoding code and the format specification so the code doesn't end up being too messy. Ideally it would let me keep different versions of the format. Below is a very rough outline of what I'm looking for.

Example my_data_format.xml:

<format version="1A">
  <record name="My first record">
    <ignore bytes="2" />
    <field name="A simple number" bytes="1" convert_to="int" />
    <field name="A simple float" bytes="4" convert_to="float" />
    <array name="A list of floats" length="3">
      <field bytes="4" convert_to="float"
    </array>
    <field bytes="2">
      <ignore bits="5" />
      <bitfield name="First bit-field" num_bits="6" convert_to="uint8" />
      <bitfield name="Second bit-field" num_bits="5" convert_to="float" />
    </field>
  </record>
</format>

Example python script my_data_reader.py:

from binary_schema import load_schema

schema = load_schema('my_data_format.xml')

with open(̈́'myfile.bin', 'rb') as f:
  decoded_data = schema.read_record_from_stream('Record header', f)

print(decoded_data)

Which would produce a dictionary:

{'A simple float': 3.234,
 'A simple number': 3,
 'A list of floats': [1., 2., 3.],
 'First bit-field': 3,
 'Second bit-field': 2.0}

Is there such a thing?

I've looked at a couple of things already:

  • I know things like protocol buffers are useful for specifying records, but as far as I understand it doesn't support specifying bitfields and their interpretation.

  • There's DFDL which seems like it's exactly what I need, but I've only seen a Java client and it looks like it's a big bulky software package (although there's apparently a C version somewhere).

  • My current implementation uses construct which works nicely, but it feels a bit more messy than loading the schema from file


Solution

  • Check out https://kaitai.io/ "Kaitai Struct: A new way to develop parsers for binary structures."

    I think you will find that it will do what you need, the schema is not XML, but I think the format is much more flexible than XML as well.