I am a bit confused with bencoding.
According to the specification when I bencode string I need to use the following format:
length:string
String spam becomes 4:spam
My question: 4 is qty of symbols of bencoded string, or qty of utf-8 bytes?
For instance, if I am going to bencode a string gâteau
What number should be specified as a length of this string?
I think I have to specify 7, and the final form should be 7:gâteau
It is because symbol â took 2 bytes accoring to utf-8 encoding, and all the rest symbols in this string took 1 byte according to utf-8 encoding.
Also I heard that it is not recommended to store bencoded data in java String instance.
In other words, when I bencode a data block, I should store it as a byte array and should not convert it to java String value to avoid encoding issues.
Are my assumptions correct?
According to the specification, bencoded string is a sequence of bytes, and you have to specify qty of bytes for this sequence as its length.
And, from the specification: "All character string values are UTF-8 encoded".
And for your case with "gâteau" you should specify 7 as length, because character â takes 2 bytes.