Search code examples
javaintbyteprimitive

Variable length-encoding of int to 2 bytes


I'm implementing variable lenght encoding and reading wikipedia about it. Here is what I found:

0x00000080  0x81 0x00

It mean 0x80 int is encoded as 0x81 0x00 2 bytes. That what I cannot understand. Okay, following the algorithm listed there we have.

  1. Binary 0x80: 00000000 00000000 00000000 10000000
  2. We move the sign bit to the next octet so we have and set to 1 (indicating that we have more octets): 00000000 00000000 00000001 10000000 which is not equals to 0x81 0x00. I tried to write a program for that:

    byte[] ba = new byte[]{(byte) 0x81, (byte) 0x00};
    int first = (ba[0] & 0xFF) & 0x7F;
    int second = ((ba[1] & 0xFF) & 0x7F) << 7;
    int result = first | second;
    System.out.println(result); //prints 1, not 0x80
    

ideone

What did I miss?


Solution

  • Let's review the algorithm from the Wikipedia page:

    1. Take the binary representation of the integer
    2. Split it into groups of 7 bits, the group with the highest value will have less
    3. Take these seven bits as a byte, setting the MSB (most significant bit) to 1 for all but the last; leave it 0 for the last one

    We can implement the algorithm like this:

    public static byte[] variableLengthInteger(int input) {
        // first find out how many bytes we need to represent the integer
        int numBytes = ((32 - Integer.numberOfLeadingZeros(input)) + 6) / 7;
        // if the integer is 0, we still need 1 byte
        numBytes = numBytes > 0 ? numBytes : 1;
        byte[] output = new byte[numBytes];
        // for each byte of output ...
        for(int i = 0; i < numBytes; i++) {
            // ... take the least significant 7 bits of input and set the MSB to 1 ...
            output[i] = (byte) ((input & 0b1111111) | 0b10000000);
            // ... shift the input right by 7 places, discarding the 7 bits we just used
            input >>= 7;
        }
        // finally reset the MSB on the last byte
        output[0] &= 0b01111111; 
        return output;
    }
    

    You can see it working for the examples from the Wikipedia page here, you can also plug in your own values and try it online.