Search code examples
pythonprotocol-buffersmessageencode

protobuf embedded message leads to extra bytes, is this delimeter?


I was trying sample code of protobuf->python, I've got pytest.proto

message Person{
    required string name=1;
    required int32 id=2;
    optional string email=3;

    enum PhoneType{
        mobile=0;
        home=1;
        work=2;
    }
    message PhoneNumber{
        required string number=1;
        optional PhoneType type=2[default=home];
    }
    repeated PhoneNumber phone=4;
}

Compile it

protoc pytest.proty --python_out=./

Then my python file:

import pytest_pb2
import sys
person=pytest_pb2.Person()
person.name="bbb"
person.id=9

phone_number=person.phone.add()
phone_number.number="aaa"
phone_number.type=pytest_pb2.Person.work
f=open("log4py.data","w")
s=person.SerializeToString()
f.write(s)
f.close()

Run it:

$python pytest.py && xxd log4py.data
00000000: 0a03 6262 6210 0922 070a 0361 6161 1002  ..bbb.."...aaa..
          name="bbb"  id=9  ???  number="aaa" type=home

from above I can see

0a03 6262 62 --> name="bbb"
1009         --> id=9
22 07        --> What's this??????????????????
0a03 616 161 --> number="aaa"
1002         --> type=home

I didn't get what's the extra bytes of "22 07" mean here, seems to indicate there's an embedded structure? So I changed my python program to have 2 "phone_number" instances, as below:

phone_number1=person.phone.add()
phone_number1.number="aaa"
phone_number1.type=pytest_pb2.Person.work
phone_number2=person.phone.add()
phone_number2.number="ccc"
phone_number2.type=pytest_pb2.Person.work

Run it and I got:

$python pytest.py && xxd log4py.data
00000000: 0a03 6262 6210 0922 070a 0361 6161 1002  ..bbb.."...aaa..
00000010: 2207 0a03 6363 6310 02                   "...ccc..

Well, this time, I see "22 07" twice, before each PhoneNumber instance. I knew that Protobuf doesn't encode any delimeter bytes, but here seems "22 07" are delimeters. Any explanations?


Solution

  • The bytes are the tag and length of the sub-message.

    22 is a tag. The bottom three bits (2) indicate that the following field value is a length-delimited value. The upper 5 bits (4) indicate that this is field number 4, which is the phone field.

    07 is the length. The sub-message is 7 bytes long.

    I knew that Protobuf doesn't encode any delimeter bytes

    Not true: Sub-messages have to be delimited somehow. Protobuf prefers delimiting using a length prefix rather than a special end tag because it lets you skip over the field without decoding every byte.