We work with a pipeline of kafka/samza jobs using protobuf encoded messages. The pipeline can be quite lengthy for certain data sets and we want to add a timestamp/id for each stage in the pipeline to monitor efficiency and service health.
The additional information would be added to a repeated field in the schema called touchpoints. Obviously decoding the message in java/samza, adding the additional message and serializing again has an overhead which increases with the size of the message (some can be quite large increasing deserialize time), some parts of the pipe are just filters which check the message key and may not even have to deserialize at all so the less overhead on these the better.
Is it possible to just inject a second serialized message into an existing message without deserializing, if so would this be very bad practice to do so (I can only think it would) and is there a better solution to not having to deserialize/add/serialize for monitoring message path/time to flow
In general, this would be quite tricky and can't be done in a "streaming" way for the following reason: Child messages are prefixed with their size encoded in a variable length integer. So injecting something would mean to adjust all the parent sizes recursively up to the root, and the size changes may move the contents again because of the variable length encoding of the size.
One thing you could do to avoid this problem might be to use fixed size fields for the time stamps, and to make sure they are filled with a value when building the protos at the first stage, so you have already allocated the corresponding space in the proto. This should allow you to to scan the proto for the (ideally unique) time stamp field ids using a CodedInputStream
, and to write the patched stream back using a CodedOutputStream
. Getting this correct will still require understanding the internal format. I'd recommend to start with an empty pass-trough "filter" first and to check that the output matches the input (update the question if you run into any issues with that)