I am trying to parse a raw binary file in Python with known headers and lengths.
The data is a 6 channel multiplexed video.
The file follows the following rules to separate frames:
Beginning of image data
E0 01 00 00 D2 59 80 C1 27 3F EC BB 31 7B 3F EC
BB 31 7B 0F 9B 90 5D A8 81 AA 5F A9 C1 D2 4B B9
9D 0A 8D 1B 8F 89 44 FF 4E 86 92 AD 00 90 5B A8
End of image data
67 49 0B B5 BC 82 38 AE 5E 46 49 86 6A FF 24 97
69 8C 6F 17 6D 67 B5 11 C7 E5 FB E3 3F 65 1F 22
5C F3 7C D0 7C 49 2F CD 26 37 4D 40 FF FF FF FF
The source files are several GB large. What is the best way to parse each channel into a separate file? Also, how can I batch process several files at once, saving the files according to the input name?
Parsing tiny chunks of multi-GB binary files is probably not something Python is going to be very fast at, as it will require a ton of function calls and object creation, meaning a lot of RAM and CPU overhead. If you need more performance or control over memory management, it's probably best to do this in a lower-level language (C, C++, Go, Rust).
However, you can do this kind of thing in Python using the struct module, something like this:
header = struct.Struct('>BBBH')
data = b'\xE0\x01\x00\x00\xD2\x59\x80...' # read this from input file
view = memoryview(data)
offset = 0
while offset < len(data):
channel, _, _, length = header.unpack(view[offset:offset + header.size])
write_output(channel, view[header.size:header.size + length])
offset += length
Things to note:
<
vs >
in the format string)write_output()
aboveAs far as batch processing several files at once, your best bet there is the multiprocessing module. It does take a while to get your head around, but after that it's pretty simple to use.