Search code examples
rustencodingprotocol-buffers

How to properly encode a VarInt?


I have a YAML file with test cases for encoding and decoding elements, which are guaranteed to be correct. The left-hand side represents the expected encoded bytes, and the right-hand side contains the original number. For VarInts, the test cases are:

examples:
"\0": 0
"\u0001": 1
"\u000A": 10
"\u00c8\u0001": 200
"\u00e8\u0007": 1000
"\u00a9\u0046": 9001
"\u00ff\u00ff\u00ff\u00ff\u00ff\u00ff\u00ff\u00ff\u00ff\u0001": -1

The first three examples work correctly when interpreted as unsigned numbers. However, the fourth example (200) and the subsequent ones don't yield the correct results.

Specifically for 200, I have the following minimally reproducible example:

use bytes::{Buf, BufMut};
use integer_encoding::{VarIntReader, VarIntWriter, VarInt};
        
let value = "\u{00c8}\u{0001}";
// "È\u{1}"
println!("Expected encoded number as a string: {:?}", value);
let mut buf: &[u8] = value.as_bytes();
// [195, 136, 1]
println!("Expected encoded number as a byte array: {:?}", buf);

let num_as_i32: i32 = 200;
let mut wr = vec![].writer();
wr.write_varint(num_as_i32);
let encoded_result_as_i32: Vec<u8> = wr.into_inner();
// [144, 3]
println!("Encoded result as i32: {:?}", encoded_result_as_i32);

let num_as_u32: u32 = 200;
let mut wr2 = vec![].writer();
wr2.write_varint(num_as_u32);
let encoded_result_as_u32: Vec<u8> = wr2.into_inner();
// [200, 1]
println!("Encoded result as u32: {:?}", encoded_result_as_u32);

The result [200, 1] seems to make sense as it matches the hex values for "\u00c8\u0001", but it doesn't match the supposedly expected value of [195, 136, 1].

The last example(-1) should be encoded as 1 according to the protobuf VarInt reference, so there seems to be something I'm missing about that as well.

Is there something wrong with the string interpretation of the expected encoded values? Or is something missing in the encoding process?


Solution

  • The issue here is that "\u00c8\u0001" needs to be read as a byte array [200, 1] instead of an UTF-8 string, which gets incorrectly interpreted as [195, 136, 1].

    The encoding itself is correct and the solution would be to either read the encoding correctly without allowing it to be converted to UTF-8, or to allow it to be converted into UTF-8 and revert it to the correct byte array if possible.

    This would be more adequate as a separate question, so I'm closing this one. Kudos to @cafce25 for the help!

    Edit: solution