Search code examples
kaitai-struct

Can Kaitai Struct be used to describe TLV data without creating new types for each field?


I'm reverse engineering a file format that stores each field as TLV blocks (type, length, value).

The fields do not have to be in order, or even present at all. Their presence is denoted with a sentinel, which is a 16-bit type identifier and a 32-bit end offset. There are hundreds of unique identifiers, but a decent chunk of those are just single primitive values. aside from denoting the type, they can also identify what field the data should be stored in.

It is also worth noting that there will never be a duplicate id on a parent structure. The only time is can occur is if there are multiple of the same object type in an array/list.

I have successfully written a Kaitai definition for one of them:

meta:
  id: struct_02ea
  endian: le

seq:
  - id: unk_00
    type: s4
    
  - id: fields
    type: field_block
    repeat: eos
    
types:
  sentinel:
    seq:
      - id: id
        type: u2
      
      - id: end_offset
        type: u4
  field_block:
    seq:
      - id: sentinel
        type: sentinel
      - id: value
        type:
          switch-on: sentinel.id
          cases:
            0xF0: u1
            0xF1: u1
            0xF2: u1
            0xF3: u1
            0xF4: u4
            0xF5: u4
        size: sentinel.end_offset - _root._io.pos

Handling things this way does work, and I could likely map out the entire format like this. However, when it comes time to compiling this definition into another format, things get nasty.

Since I am wrapping each field in a field_block, the generated code stores these values in that type of object. This is incredibly inefficient when half of the generated field_block objects store a single integer. It would also require the consuming code to iterate through a list of each field block in order to get the actual field's value.

Ideally, I would like to define this structure so that the sentinels are only parsed while Kaitai is reading the data, and each value would be mapped to a field on the parent structure.

Is this possible? This technology is really cool, and I'd love to use it in my project, but I feel like the overhead that this is generating is a lot more trouble than it's worth.

Here's an example of the definition when compiled into C#:

using System.Collections.Generic;

namespace Kaitai
{
    public partial class Struct02ea : KaitaiStruct
    {
        public static Struct02ea FromFile(string fileName)
        {
            return new Struct02ea(new KaitaiStream(fileName));
        }

        public Struct02ea(KaitaiStream p__io, KaitaiStruct p__parent = null, Struct02ea p__root = null) : base(p__io)
        {
            m_parent = p__parent;
            m_root = p__root ?? this;
            _read();
        }
        private void _read()
        {
            _unk00 = m_io.ReadS4le();
            _fields = new List<FieldBlock>();
            {
                var i = 0;
                while (!m_io.IsEof) {
                    _fields.Add(new FieldBlock(m_io, this, m_root));
                    i++;
                }
            }
        }
        public partial class Sentinel : KaitaiStruct
        {
            public static Sentinel FromFile(string fileName)
            {
                return new Sentinel(new KaitaiStream(fileName));
            }

            public Sentinel(KaitaiStream p__io, Struct02ea.FieldBlock p__parent = null, Struct02ea p__root = null) : base(p__io)
            {
                m_parent = p__parent;
                m_root = p__root;
                _read();
            }
            private void _read()
            {
                _id = m_io.ReadU2le();
                _endOffset = m_io.ReadU4le();
            }
            private ushort _id;
            private uint _endOffset;
            private Struct02ea m_root;
            private Struct02ea.FieldBlock m_parent;
            public ushort Id { get { return _id; } }
            public uint EndOffset { get { return _endOffset; } }
            public Struct02ea M_Root { get { return m_root; } }
            public Struct02ea.FieldBlock M_Parent { get { return m_parent; } }
        }
        public partial class FieldBlock : KaitaiStruct
        {
            public static FieldBlock FromFile(string fileName)
            {
                return new FieldBlock(new KaitaiStream(fileName));
            }

            public FieldBlock(KaitaiStream p__io, Struct02ea p__parent = null, Struct02ea p__root = null) : base(p__io)
            {
                m_parent = p__parent;
                m_root = p__root;
                _read();
            }
            private void _read()
            {
                _sentinel = new Sentinel(m_io, this, m_root);
                switch (Sentinel.Id) {
                case 243: {
                    _value = m_io.ReadU1();
                    break;
                }
                case 244: {
                    _value = m_io.ReadU4le();
                    break;
                }
                case 245: {
                    _value = m_io.ReadU4le();
                    break;
                }
                case 241: {
                    _value = m_io.ReadU1();
                    break;
                }
                case 240: {
                    _value = m_io.ReadU1();
                    break;
                }
                case 242: {
                    _value = m_io.ReadU1();
                    break;
                }
                default: {
                    _value = m_io.ReadBytes((Sentinel.EndOffset - M_Root.M_Io.Pos));
                    break;
                }
                }
            }
            private Sentinel _sentinel;
            private object _value;
            private Struct02ea m_root;
            private Struct02ea m_parent;
            public Sentinel Sentinel { get { return _sentinel; } }
            public object Value { get { return _value; } }
            public Struct02ea M_Root { get { return m_root; } }
            public Struct02ea M_Parent { get { return m_parent; } }
        }
        private int _unk00;
        private List<FieldBlock> _fields;
        private Struct02ea m_root;
        private KaitaiStruct m_parent;
        public int Unk00 { get { return _unk00; } }
        public List<FieldBlock> Fields { get { return _fields; } }
        public Struct02ea M_Root { get { return m_root; } }
        public KaitaiStruct M_Parent { get { return m_parent; } }
    }
}

Solution

  • Affiliate disclaimer: I'm a Kaitai Struct maintainer (see my GitHub profile).

    Since I am wrapping each field in a field_block, the generated code stores these values in that type of object. This is incredibly inefficient when half of the generated field_block objects store a single integer. It would also require the consuming code to iterate through a list of each field block in order to get the actual field's value.

    I think that rather than trying to describe the entire format with an ultimate Kaitai Struct specification, it's better for you not to let the generated code parse all the fields automatically. Move the parsing control to your application code, where you use the type Struct02ea.FieldBlock that represents the individual field and basically replicate the "repeat until end of stream" loop that the generated code that you posted was doing:

                _fields = new List<FieldBlock>();
                {
                    var i = 0;
                    while (!m_io.IsEof) {
                        _fields.Add(new FieldBlock(m_io, this, m_root));
                        i++;
                    }
                }
    

    The advantage of doing so is that you can adjust the loop to fit your needs. To avoid the overhead you describe, you'll probably want to keep the Struct02ea.FieldBlock object in a local variable inside the loop body, pull only the values you care about (save them in your compact, consumer-friendly output structures) and let it leave the scope after the loop iteration ends. This will allow each original FieldBlock object to get garbage-collected once you process it, so the overhead they have will be limited to a single instance and not multiplied by the number of fields in the file.

    The most straightforward and seamless way to prevent the Kaitai Struct-generated code parse fields (but otherwise keep everything the same) is to add if: false in the KSY specification, as @webbnh suggested in a GitHub issue:

    seq:
      - id: unk_00
        type: s4
        
      - id: fields
        type: field_block
        repeat: eos
        if: false  # add this
    

    The if: false works better than omitting it from seq entirely, because the kaitai-struct-compiler has occasional troubles with unused types (when compiling the KSY spec with unused types, you may get an error "Unable to derive _parent type in ..." due to a compiler bug). But with this if: false trick, you can't run into them because the field_block type is no longer unused.