Search code examples
pythonc++parsingprotocol-buffersbinaryfiles

Deserialize Google Protobuf binary file


Google Protobuf has confused me more than ever and I am trying to understand how things work.

  1. Please help me understand if I am understanding this correctly. .proto file defines the definition of the message and protoc is a compiler. The data is compiled in a binary file (.pb) . Correct? If not, can you please help me understand. I can't process Google Protobuf docs. It is quite confusing. And haven't had any luck with Stack Overflow or other blogs.

  2. Important, I can't modify my C++ code where the logic is defined. With that being said, I'd like to deserialize filename.pb (binary file) and parse results through Python. Is this possible?

Thanks for your help in advance!!


Solution

  • I am not sure if I am answering your question at all, but I'll give you an example. Once you have your Protocol_pb2.py, you can parse the messages quite easily with the protobuf python API (and possibly struct). I am not an expert in protobuf, but I have at least parsed Mumble messages. Now, I do not know how exactly you are using protobuf and what for, but this example shows how to parse mumble messages (that use protobuf), and hopefully will give you some insight. So:

    Import the protoc-created protobuf-file (you create it in the way that I answered to your last question) & struct:

    import struct
    import Mumble_pb2 as mumble_protobuf
    

    The mumble_protobuf module will contain the different messages defined in your .proto file converted to Python format. You can save the different message types to for example a dict:

    PACKET_TYPES = {
        0: mumble_protobuf.Version,
        1: mumble_protobuf.UDPTunnel,
        2: mumble_protobuf.Authenticate,
    ...
    

    I'm skipping things here but once you receive the binary data of a packet you can parse it. I suppose each application does things differently, but for example mumble sends the protobuf message prefixed with 2 bytes that include the message type, and 4 bytes that tell the packet length. This probably is done differently in your application. Whatever the case, you must somehow know what kind of message you will be parsing (assuming your protocol has multiple message types). But as an example:

    # The header format corresponds to the mumble packet that send the prefix + protobuf-msg
    HEADER_FORMAT = ">HI" # Big endian, unsigned short + unsigned int
    packet_type, packet_length = struct.unpack_from(HEADER_FORMAT, buffer)
    

    If you have multiple messages in a buffer you would need to extract the packet data from your buffer based on the packet length. When you know the message type, you can parse it. This part will just fetch the class corresponding to the message type from the pb2-file, and parse the message based on it.

    MessageClass = PACKET_TYPES[packet_type]
    message = MessageClass()
    message.ParseFromString(packet_data)
    

    Now message will finally contain the parsed protobuf message, and you can use it like you would when constructing one: you can reference to the fields etc: message.UserName or whatever.