python logging protocol-buffers proto nanopb

Deserializing a Streamed Protocol Buffer Message With Header and Repeated fields

I am working on deserializing a log file that has been serialized in C using protocol buffers (and NanoPB).

The log file has a short header composed of: entity, version, and identifier. After the header, the stream of data should be continuous and it should log the fields from the sensors but not the header values (this should only occur once and at the beginning).The same .proto file was used to serialize the file. I do not have separate .proto files for the header and for the streamed data.

After my implementation, I assume it should look like this:

firmware "1.0.0"
GUID "1231214211321" (example)
Timestamp 123123
Sens1 2343
Sens2 13123
Sens3 13443
Sens4 1231
Sens5 190
Timestamp 123124
Sens1 2345
Sens2 2312
...

I posted this question to figure out how to structure the .proto file initially, when I was implementing the serialization in C. And in the end I used a similar approach but did no include the: [(nanopb).max_count = 1];

Finally I opted with the following .proto in Python (There can be more sensors than 5):

syntax = "proto3";

import "timestamp.proto";

 message SessionLogs {
int32 Entity = 1;      
string Version = 2;  
string GUID = 3;     
repeated SessionLogsDetail LogDetail = 4;
}
message SessionLogsDetail 
{    
int32 DataTimestamp = 1;        // internal counter to identify the order of session logs
// Sensor data, there can be X amount of sensors.
int32 sens1 = 2;
int32 sens2= 3;
int32 sens3= 4;
int32 sens4= 5;
}

At this point, I can serialize a message as I log with my device and according to the file size, the log seems to work, but I have not been able to deserialize it on Python offline to check if my implementation has been correct. And I can't do it in C since its an embedded application and I want to do the post-processing offline with Python.

Also, I have checked this online protobuf deserializer where I can pass the serialized file and get it deserialized without the need of the .proto file. In it I can see the header values (field 3 is empty so its not seen) and the logged information. So this makes me think that the serialization is correct but I am deserializing it wrongly on Python.

This is my current code used to deserialize the message in Python:

import PSessionLogs_pb2

with open('$PROTOBUF_LOG_FILENAME$', 'rb') as f:
read_metric =  PSessionLogs_pb2.PSessionLogs()
read_metric.ParseFromString(f.read())

Besides this, I've used protoc to generate the .py equivalent of the .proto file to deserialize offline.

Solution

It looks like you've serialized a header, then serialized some other data immediately afterwards, meaning: instead of serializing a SessionLogs that has some SessionLogsDetail records, you've serialized a SessionLogs, and then you've serialized (separately) a SessionLogsDetail - does that sound about right? if so: yes, that will not work correctly; there are ways to do what you're after, but it isn't quite as simple as just serializing one after the other, because the root protobuf object is never terminated; so what actually happens is that it overwrites the root object with later fields by number.

There's two ways of addressing this, depending on the data volume. If the size (including all of the detail rows) is small, you can just change the code so that it is a true parent / child relationship, i.e. so that the rows are all inside the parent. When writing the data, this does not mean that you need to have all the rows before you start writing - there are ways of making appending child rows so that you are sending data as it becomes available; however, when deserializing, it will want to load everything in one go, so this approach is only useful if you're OK with that, i.e. you don't have obscene open-ended numbers of rows.

If you have large numbers of rows, you'll need to add your own framing, essentially. This is often done by adding a length-prefix between each payload, so that you can essentially read a single message at a time. Some of the libraries include helper methods for this; for example, in the java API this is parseDelimitedFrom and writeDelimitedTo. However, my understand is that the python API does not currently support this utility, so you'd need to do the framing yourself :(

To summarize, you currently have:

{header - SessionLogs}
{row 0 - SessionLogsDetail}
{row 1 - SessionLogsDetail}

option 1 is:

{header - SessionLogs
  {row 0 - SessionLogsDetail}
  {row 1 - SessionLogsDetail}
}

option 2 is:

{length prefix of header}
{header - SessionLogs}
{length prefix of row0}
{row 0 - SessionLogsDetail}
{length prefix of row1}
{row 1 - SessionLogsDetail}

(where the length prefix is something simple like a raw varint, or just a 4-byte integer in some agreed endianness)