Search code examples
javaunicodecodepoint

How to establish the codepoint of encoded characters?


Given a stream of bytes (that represent characters) and the encoding of the stream, how would I obtain the code points of the characters?

InputStreamReader r = new InputStreamReader(bla, Charset.forName("UTF-8"));
int whatIsThis = r.read(); 

What is returned by read() in the above snippet? Is it the unicode codepoint?


Solution

  • Reader.read() returns a value that can be cast to char or -1 if no more data is available.

    A char is (implicitly) a 16-bit code unit in the UTF-16BE encoding. This encoding can represent basic multilingual plane characters with a single char. The supplementary range is represented using two-char sequences.

    The Character type contains methods for translating UTF-16 code units to Unicode code points:

    A code point that requires two chars will satisfy the isHighSurrogate and isLowSurrogate when you pass in two sequential values from a sequence. The codePointAt methods can be used to extract code points from code unit sequences. There are similar methods for working from code points to UTF-16 code units.


    A sample implementation of a code point stream reader:

    import java.io.*;
    public class CodePointReader implements Closeable {
      private final Reader charSource;
      private int codeUnit;
    
      public CodePointReader(Reader charSource) throws IOException {
        this.charSource = charSource;
        codeUnit = charSource.read();
      }
    
      public boolean hasNext() { return codeUnit != -1; }
    
      public int nextCodePoint() throws IOException {
        try {
          char high = (char) codeUnit;
          if (Character.isHighSurrogate(high)) {
            int next = charSource.read();
            if (next == -1) { throw new IOException("malformed character"); }
            char low = (char) next;
            if(!Character.isLowSurrogate(low)) {
              throw new IOException("malformed sequence");
            }
            return Character.toCodePoint(high, low);
          } else {
            return codeUnit;
          }
        } finally {
          codeUnit = charSource.read();
        }
      }
    
      public void close() throws IOException { charSource.close(); }
    }