I have a simple (and maybe silly) question about binary data files. If a simple type is used (int/float/..) it is easy to imagine the structure of the binary file (a sequence of floats, with each float written using a fixed number of bytes). But what about structures, objects and functions ? Is there some kind of convension for each language with regards to the order in which the variables names / attributes / methods are written, and if so, can this order be changed and cusotomized ? otherwise, is there some kind of header that describes the format used in each file ?
I'm mostly interested in python and C/C++. When I use a pickled (or gzipped) file for example, python "knows" whether the original object has a certain method or attribute without me casting the unpickled object or indicating its type and I've always wondered how is that implemented. I didn't know how to look this up on Google because it may have something to do with how these languages are designed in the first place. Any pointers would be much appreciated.
It's called serialization - because it is about serializating your memory data structures into linear stream of bytes - files.
The basic algorithm is something like "iterate over all keys and values in a dict (or over all keys in a list) and print them into a file". But you have to specify a format first - if you store a string, how do you know when it ends? Well, you have to store its length first, or use some kind of end-of-string mark (like "
in JSON).
Some serialization formats that are widely used for custom data are JSON, YAML, XML, MessagePack, Google Protocol Buffers...
For an idea how this works have a look for example at msgpack spec or Cap'n'Proto Encoding Spec (Cap'n'Proto is another serialization format, a bit low-level one).
For Python pickle there is PEP 3154 with protocol 4 specification and of course also a source code of the picke module.