Search code examples
pythonparsingmultiplexing

Parse raw binary file for multiple sets of video frames


I am trying to parse a raw binary file in Python with known headers and lengths.

The data is a 6 channel multiplexed video.

The file follows the following rules to separate frames:

  • Byte 1: Indicates the channel # (eg 0xE0, 0xE1, 0xE2 ...)
  • Bytes 4 & 5: Represent the length of the image data
  • Bytes 6: Length: Image data
  • The end of the image data is padded with 0xFF so that every image chunk starts on the first byte in 16-byte rows.

Beginning of image data

E0 01 00 00 D2 59 80 C1 27 3F EC BB 31 7B 3F EC 

BB 31 7B 0F 9B 90 5D A8 81 AA 5F A9 C1 D2 4B B9

9D 0A 8D 1B 8F 89 44 FF 4E 86 92 AD 00 90 5B A8

End of image data

67 49 0B B5 BC 82 38 AE 5E 46 49 86 6A FF 24 97 

69 8C 6F 17 6D 67 B5 11 C7 E5 FB E3 3F 65 1F 22 

5C F3 7C D0 7C 49 2F CD 26 37 4D 40 FF FF FF FF

The source files are several GB large. What is the best way to parse each channel into a separate file? Also, how can I batch process several files at once, saving the files according to the input name?


Solution

  • Parsing tiny chunks of multi-GB binary files is probably not something Python is going to be very fast at, as it will require a ton of function calls and object creation, meaning a lot of RAM and CPU overhead. If you need more performance or control over memory management, it's probably best to do this in a lower-level language (C, C++, Go, Rust).

    However, you can do this kind of thing in Python using the struct module, something like this:

    header = struct.Struct('>BBBH')
    data = b'\xE0\x01\x00\x00\xD2\x59\x80...'  # read this from input file
    view = memoryview(data)
    offset = 0
    while offset < len(data):
        channel, _, _, length = header.unpack(view[offset:offset + header.size])
        write_output(channel, view[header.size:header.size + length])
        offset += length
    

    Things to note:

    • Determine whether the length is in big endian or little endian (< vs > in the format string)
    • Using the memoryview is a way to avoid some of the extra object copying and creation -- hopefully it makes this more efficient
    • You'll want to keep the output file(s) open -- I've just hidden this behind write_output() above
    • If the input is multi-GB, you probably want to read the input file in chunks of 1MB or something sensible, rather than all at once
    • Be careful about bytes vs strings (different handling on Python 2.x vs 3.x)
    • If you need to know more about opening, reading, and writing files, feel free to post more specific questions

    As far as batch processing several files at once, your best bet there is the multiprocessing module. It does take a while to get your head around, but after that it's pretty simple to use.