Search code examples
scalaprotocol-buffersprotobuf-java

How to encode string in protobuf using UnknownFieldSet


I am writing a raw protobuf message with the library com.google.protobuf, leveraging UnknownFieldSet and I am encountering a problem when encoding strings as they sometimes break the result.

I want to encode:

1 -> ["stuff", "stuff"]
2 -> ["stuff","android.microphone","stuff"]

which I figured can be done using the following code:

import com.google.protobuf.{ByteString, UnknownFieldSet}

// ....
def doEncoding() : UnknownFieldSet  = {
    UnknownFieldSet.newBuilder()
        .addField(1,UnknownFieldSet.Field.newBuilder()
          .addLengthDelimited(ByteString.copyFromUtf8("stuff"))
          .addLengthDelimited(ByteString.copyFromUtf8("stuff"))
          .build())
        .addField(2,UnknownFieldSet.Field.newBuilder()
          .addLengthDelimited(ByteString.copyFromUtf8("stuff"))
          .addLengthDelimited(ByteString.copyFromUtf8("android.microphone"))
          .addLengthDelimited(ByteString.copyFromUtf8("stuff"))
          .build())
        .build()
}

However, dumping the resulting bytes into a file using .toByteArray on the UnknownFieldSet and then reading the data using protod results in an unexpected data structure:

[0a] 1 string: (5) stuff (73 74 75 66 66)
[0a] 1 string: (5) stuff (73 74 75 66 66)
[12] 2 string: (5) stuff (73 74 75 66 66)
[12] 2 string: (18) android.microphone

    [61] 12 fixed64/double: 7867336003066946670 (0x6d2e64696f72646e) (8.381649661287266e+217)
    [69] 13 fixed64/double: 7308901739622527587 (0x656e6f68706f7263) (3.9466026192472086e+180)
[12] 2 string: (5) stuff (73 74 75 66 66)

The first array is fine, but the second is broken and contains data values never entered.

What am I doing wrong when adding the string to the raw protobuf?


Solution

  • This is because Protobuf messages can be ambiguous and Protobufs rely on a schema (Protobuf) to disambiguate. Corollary: Multiple Protobuf schema may produce the same Protobuf message.

    message X {
      string s = 1;
    }
    

    Using your preferred Protobuf SDK, the following message:

    X{
      S: "android.microphone",
    }
    

    Marshals to (hex-encoded):

    0a12616e64726f69642e6d6963726f70686f6e65
    

    And using protoc to decode the message without a schema:

    printf "0a12616e64726f69642e6d6963726f70686f6e65" \
    | xxd -r -p \
    | protoc --decode_raw
    
    1 {
      12: 0x6d2e64696f72646e
      13: 0x656e6f68706f7263
    }
    

    These values match your fixed64/double values.

    Using protoc with the schema, decodes the string correctly:

    protoc --decode=X x.proto
    
    x: "android.microphone"
    

    You can corroborate this with Protobuf Decoder too using the hex-encoded output above.

    This is unavoidable without a schema.