I created CollationKey for a String "a" and then I used method toByteArray() to convert the CollationKey to a sequence of bits. After that I use Arrays.toString() to display this byte[] array and I get an output I don't understand. I thought I will get String represented in bits. How to interpret the output? Thank You
package myPackage9;
import java.text.CollationKey;
import java.text.*;
import java.lang.*;
import java.util.Arrays;
public class collatorClass {
public static void main(String[] args) {
Collator myCollator = Collator.getInstance();
CollationKey[] a = new CollationKey[1];
a[0] = myCollator.getCollationKey("a");
byte[] bytes= a[0].toByteArray();
System.out.println(Arrays.toString(bytes));
}
}
output: [0, 83, 0, 0, 0, 1, 0, 0, 0, 1]
CollationKey
is an abstract class. Most likely your concrete type is a RuleBasedCollationKey
. First, let's look at the JavaDoc of the method:
Converts the CollationKey to a sequence of bits. If two CollationKeys could be legitimately compared, then one could compare the byte arrays for each of those keys to obtain the same result. Byte arrays are organized most significant byte first.
Apparently, the collation key of "a" is not represented by the same bytes as the string "a", which isn't all too surprising
The next step is to look at its source to understand what it is returning exactly:
public byte[] toByteArray() {
char[] src = key.toCharArray();
byte[] dest = new byte[ 2*src.length ];
int j = 0;
for( int i=0; i<src.length; i++ ) {
dest[j++] = (byte)(src[i] >>> 8);
dest[j++] = (byte)(src[i] & 0x00ff);
}
return dest;
}
What is key
? It is passed in as second constructor parameter. The constructor is called in RuleBasedCollator#getCollationKey
. The source is quite complicated, but the method's JavaDoc states:
Transforms the string into a series of characters that can be compared with CollationKey.compareTo. This overrides java.text.Collator.getCollationKey. It can be overriden in a subclass.
Looking at the inline code comments of the method, it is explained further:
// The basic algorithm here is to find all of the collation elements for each
// character in the source string, convert them to a char representation,
// and put them into the collation key. But it's trickier than that.
// Each collation element in a string has three components: primary (A vs B),
// secondary (A vs A-acute), and tertiary (A' vs a); and a primary difference
// at the end of a string takes precedence over a secondary or tertiary
// difference earlier in the string.
//
// To account for this, we put all of the primary orders at the beginning of the
// string, followed by the secondary and tertiary orders, separated by nulls.
Followed by a hypothetical example:
// Here's a hypothetical example, with the collation element represented as
// a three-digit number, one digit for primary, one for secondary, etc.
//
// String: A a B \u00e9 <--(e-acute)
// Collation Elements: 101 100 201 510
//
// Collation Key: 1125<null>0001<null>1010
So the assumption that a CollationKey
's toByteArray()
method would return the same as a String
's toByteArray()
method is simply wrong.
"a".toByteArray()
is not the same as Collator.getInstance().getCollationKey("a").toByteArray()
. If it were, we wouldn't really need collation keys, would we?