I have not worked with an entirely custom file format yet, but the project I am working on requires an entirely new, custom binary file format. I don't know all the best practices for the same (like using Identification bytes aka "magic numbers") and how to implement them in Python. Here are the basic requirements:
I have to read the metadata whenever such a file is provided, and get back my original Python dictionary, and then I need to retrieve the body i.e., the random bytes, for decryption. Kindly provide a basic implementation or an idea to read and write such a file in Python along with the best practices to create the custom file format.
Currently, as a temporary solution, I am serializing the dictionary using ormsgpack and prepending it to the output file, then using a custom delimiter b"\xFF\xFF\xFF\xFF"
to separate the serialized metadata from the main body.
|‾‾‾‾‾‾‾‾‾‾‾|
| metadata |
|___________|
| |
| delimiter |
|___________|
| |
| body |
|___________|
However, this might be an issue since if this particular sequence occurs somewhere in the serialized metadata, the full metadata will not be read and cause errors.
Using msgpack is a good idea.
Right after serializing, check the length of the output, and prepend it:
|‾‾‾‾‾‾‾‾‾‾‾|
| length |
| (8 bytes) |
| |
|‾‾‾‾‾‾‾‾‾‾‾|
| metadata |
|___________|
| |
| body |
|___________|
That's the way most protocols work. Decoding this will then be easier.
If the metadata is potentially too big for memory, you can set the length to 0, write the metadata, and then seek back to change the length.
A different option would be to escape the delimiter sequence, but that's more complex and won't be as useful in this scenario.