Search code examples
javaprotocol-buffersprotoproto3

Long values in objects not serialized and deserialized properly when using proto3 in Java


I am trying to serialize and deserialize an object in java using proto3. Here is what my object in proto looks like

option java_multiple_files = true;
option java_package = "com.project.dataModel";
option java_outer_classname = "FlowProto";


// The request message containing the user's name.
message Flow {
    string subscriberIMSEI = 1;
    string destinationIP = 2;
    uint64 txBytes = 3;
    uint64 rxBytes = 4;
    uint64 txPkts = 5;
    uint64 rxPkts = 6;
    uint64 startTimeInMillis = 7;
    uint64 endTimeInMillis = 8;
    string asnNumber = 9;
    string asnName = 10;
    string asnCountryCode = 11;

}

Here is how my serialization and deserialzation in java looks like

public class Test {

    public static void main(String[] args) throws Exception {

        Flow flow =
                Flow.newBuilder().setAsnName("abc")
                        .setEndTimeInMillis(123456789L)
                        .setStartTimeInMillis(123456789L)
                .setDestinationIP("1.1.1.1")
                .setTxBytes(1L)
                .setRxBytes(1L)
                .setTxPkts(1L)
                .setRxPkts(1L)
                .setAsnName("blah")
                .setAsnCountryCode("blah")
                .build();

        byte[] flowByteArray = flow.toByteArray();

        String flowString = flow.toByteString().toStringUtf8();

        System.out.println("Parsed from ByteArray:" + Flow.parseFrom(flowByteArray).getEndTimeInMillis());
        System.out.println("Parsed from ByteString:" + Flow.parseFrom(ByteString.copyFromUtf8(flowString))
                .getEndTimeInMillis());
    }
}

My output is as follows

Parsed from ByteArray:123456789
Parsed from ByteString:-4791902657223630865

Where am I going wrong when I am trying to go the ByteString and the utf-8 route for serialization and deserialization?

Thanks!


Solution

  • The reason why you are seeing an issue is because your serialized byte array is being corrupted. This happens because UTF-8 is a variable length encoding and converting to a UTF-8 string changes the bytes in your original array. When you are doing flow.toByteString().toStringUtf8() one byte in the original bytestring may be transformed into three new bytes with different values. Then when you do ByteString.copyFromUtf8(flowString) the byte changes are not undone since that line of code effectively just retrieves the transformed UTF-8 bytes, not the original bytes you put in.

    Here is a small test that illustrates the issue you are seeing

    @Test
    public void byteConsistency() {
      byte[] vals = new byte[] {0, 110, -1};
      ByteString original = ByteString.copyFrom(vals);
      ByteString newString = ByteString.copyFromUtf8(original.toStringUtf8());
    
      for (int index = 0; index < newString.size(); index++) {
        System.out.println(newString.byteAt(index));
      }
    }
    

    You would expect this code to output

    0
    110
    -1
    

    But it actually outputs

    0
    110
    -17
    -65
    -67
    

    That's because UTF-8 likely dictates that a -1 (0xFF) byte should be encoded as three bytes [-17, -65, -67].

    In summary, when dealing with protobuf don't convert serialized objects into UTF-8 strings. Only use the raw bytes for serialization and deserialization. If you try converting to UTF-8 strings the serialized bytes will become corrupted and you will not be able to deserialize them.