Search code examples
pythonparsingkaitai-struct

Parse an item depending on previous elements


To give some context, I am parsing a DICOM file, and having difficulties using the Transfer Syntax entry to determine whether to use implicit or explicit parsing. But let me define a simplified syntax, so no dicom knowledge is required.

We have a sequence of entries, each entry has a group number , and a data part. The group number is always represented as u2, but the data can be of different types, lets say u2 or u4. The order of the entries can be arbitrary, except that all the entries with group number == 2 must be at the top. All entries with group number == 2 has a data type of u2, but subsequent data parts can differ.

And here comes the hard part: all the items with group number != 2 have data type u4 if and only if an entry exactly like this was present previously:

(group, data) == (0x0002, 0x0101)

In python for example, I would parse it like this:

def read_entries(stream):
  is_u4 = False
  while not stream.eos():
    group = stream.read_u2()
    if group != 2 and is_u4:
      data = stream.read_u4()
    else:      
      data = stream.read_u2()
    if group == 2 and data == 0x0101:
      is_u4 = True
    yield (group, data)

Is there a way to achieve this using kaitai-struct?


Solution

  • Short answer

    Exact transcription of Python code to KS per se is impossible right now, only by plugging in the code written in imperative language.

    However, thanks to your additional info, an alternative approach is available, see below for solution.

    Longer answer

    Kaitai Struct emphasizes stateless parsing, thus everything we parse is effectively immutable (read-only), i.e. there are variables that can change its value during the parsing. Thus propagating is_u4 between parse cycles is non-trivial thing to do. We have similar problem with MIDI running status, for example.

    Some of the solutions suggested by people sometimes is recursive type definition + propagation of instances using _parent syntax (see issue #70), but:

    • This is currently hindered by lack of type definition for recursively defined instance values + lack of initial value seeding
    • This generates a linked list of values, not an array (i.e. that's not what most people are comfortable with)
    • This takes heavy toll on call stack (i.e. monstrously long structures would most likely fail with stack exhaustion)

    Alternative approach, given additional information you've provided might be viable. Actually, it turns out that whole stream of elements can be grouped in 3 groups:

    • elements before (0x0002, 0x0101) - they all use u2-u2 format
    • the (0x0002, 0x0101) element itself - note that it is also always u2-u2 when it's encountered for the first time
    • elements after (0x0002, 0x0101) - they all use u2-u4 format; even if additional (0x0002, 0x0101) are encountered afterwards, per the algorithm you've provided, it should be read as u2-u4, so it's ok.

    Thus it possible to get away with:

    types:
      elements:
        seq:
          - id: elements_2
            type: element_2
            repeat-until: _.is_u4
          - id: elements_4
            type: element_4
            repeat: eos
      element_2:
        seq:
          - id: group
            type: u2
          - id: data
            type: u2
        instances:
          is_u4:
            value: group == 2 and data == 0x0101
      element_4:
        seq:
          - id: group
            type: u2
          - id: data
            type: u4
    

    Please tell me if I got it right and that will work for your project?