I am trying to convert codepoints from one charset to another in Java.
For example character ř
is 248 in windows-1250
, 345 in unicode
.
So I have source charset and source codepoint and target charset and want to calculate target codepoint.
This may sound easy as windows-1250 is single byte,
but I want it to work on any charset, like GB2312
.
I guess it can be done somehow with Charset
class,
but it seems that it only converts bytes, not actual code points.
Charset sourceCharset = Charset.forName("GB2312");
int sourceCodePoint = 45257; //吧 chinese character
Charset targetCharset = Charset.forName("UTF-8");
int targetCodePoint = ...; //???
I checked Charset class for methods codepoint related, but there's only decode and encode, which works with bytes. I tried googling something related but without success.
Thanks in advance for any help.
At least in Java there is no notion of codepoints for character sets other than Unicode. You have to convert the integer to byte array and then to unicode.
Charset sourceCharset = Charset.forName("windows-1250");
int sourceCodePoint = 248; // ř
byte[] bytes = {(byte)sourceCodePoint};
String targetString = new String(bytes, sourceCharset);
int targetCodePoint = targetString.codePointAt(0);
System.out.println("targetString = " + targetString);
System.out.println("targetCodePoint = " + targetCodePoint);
output:
targetString = ř
targetCodePoint = 345
Chinese characters in GB2312 are represented by 2 bytes, so you need to store them in a byte array of length 2.
Charset sourceCharset = Charset.forName("GB2312");
int sourceCodePoint = 45257; // 吧 chinese character
byte[] bytes = ByteBuffer.allocate(2).putShort((short)sourceCodePoint).array();
String targetString = new String(bytes, sourceCharset);
int targetCodePoint = targetString.codePointAt(0);
System.out.println("targetString = " + targetString);
System.out.println("targetCodePoint = " + targetCodePoint);
output:
targetString = 吧
targetCodePoint = 21543