Search code examples
javaunicode

Read text stream codepoint by codepoint


I'm trying to read Unicode codepoints from a text file in Java. The InputStreamReader class returns the stream's contents int by int, which I hoped would do what I want, but it does not compose surrogate pairs.

My test program:

import java.io.*;
import java.nio.charset.*;

class TestChars {
    public static void main(String args[]) {
        InputStreamReader reader =
            new InputStreamReader(System.in, StandardCharsets.UTF_8);
        try {
            System.out.print("> ");
            int code = reader.read();
            while (code != -1) {
                String s =
                    String.format("Code %x is `%s', %s.",
                                  code,
                                  Character.getName(code),
                                  new String(Character.toChars(code)));
                System.out.println(s);
                code = reader.read();
            }
        } catch (Exception e) {
        }
    }
}

This behaves as follows:

$ java TestChars 
> keyboard ⌨. pizza 🍕
Code 6b is `LATIN SMALL LETTER K', k.
Code 65 is `LATIN SMALL LETTER E', e.
Code 79 is `LATIN SMALL LETTER Y', y.
Code 62 is `LATIN SMALL LETTER B', b.
Code 6f is `LATIN SMALL LETTER O', o.
Code 61 is `LATIN SMALL LETTER A', a.
Code 72 is `LATIN SMALL LETTER R', r.
Code 64 is `LATIN SMALL LETTER D', d.
Code 20 is `SPACE',  .
Code 2328 is `KEYBOARD', ⌨.
Code 2e is `FULL STOP', ..
Code 20 is `SPACE',  .
Code 70 is `LATIN SMALL LETTER P', p.
Code 69 is `LATIN SMALL LETTER I', i.
Code 7a is `LATIN SMALL LETTER Z', z.
Code 7a is `LATIN SMALL LETTER Z', z.
Code 61 is `LATIN SMALL LETTER A', a.
Code 20 is `SPACE',  .
Code d83c is `HIGH SURROGATES D83C', ?.
Code df55 is `LOW SURROGATES DF55', ?.
Code a is `LINE FEED (LF)', 
.

My problem is that the surrogate pairs making up the pizza emoji are read separately. I would like to read the symbol into a single int and be done with it.

Question: Is there a reader(-like) class that will automatically compose surrogate pairs to characters while reading? (And, presumably, throws an exception if the input is malformed.)

I know I could compose the pairs myself, but I would prefer avoiding reinventing the wheel.


Solution

  • If you take advantage of String having a method that returns a stream of codepoints, you don't have to deal with surrogate pairs yourself:

    import java.io.*;
    
    class cptest {
        public static void main(String[] args) {
            try (BufferedReader br =
                    new BufferedReader(new InputStreamReader(System.in, "UTF-8"))) {
                br.lines().flatMapToInt(String::codePoints).forEach(cptest::print);
            } catch (Exception e) {
                System.err.println("Error: " + e);
            }
        }
        private static void print(int cp) {
            String s = new String(Character.toChars(cp));
            System.out.println("Character " + cp + ": " + s);
        }
    }
    

    will produce

    $ java cptest <<< "keyboard ⌨. pizza 🍕"
    Character 107: k
    Character 101: e
    Character 121: y
    Character 98: b
    Character 111: o
    Character 97: a
    Character 114: r
    Character 100: d
    Character 32:  
    Character 9000: ⌨
    Character 46: .
    Character 32:  
    Character 112: p
    Character 105: i
    Character 122: z
    Character 122: z
    Character 97: a
    Character 32:  
    Character 127829: 🍕