Search code examples
javaunicodeutf-16charsequencesurrogate-pairs

How to verify whether an instance of CharSequence is a sequence of Unicode scalar values?


I have an instance of java.lang.CharSequence. I need to determine whether this instance is a sequence of Unicode scalar values (that is, whether the instance is in UTF-16 encoding form). Despite the assurances of java.lang.String, a Java string is not necessarily in UTF-16 encoding form (at least not according to the latest Unicode specification, currently 6.2), since it may contain isolated surrogate code units. (A Java string is, however, a Unicode 16-bit string.)

There are several obvious ways in which to go about this, including:

  1. Iterate over the code points of the sequence, explicitly validating each as a Unicode scalar value.
  2. Use a regular expression to search for isolated surrogate code points.
  3. Pipe the character sequence through a character-set encoder that reports encoding errors.

It seems as though something like this should already exist as a library function, however. I just can't find it in the standard API. Am I missing it, or do I need to implement it?


Solution

  • try this func

    static boolean isValidUTF16(String s) {
        for (int i = 0; i < s.length(); i++) {
            if (Character.isLowSurrogate(s.charAt(i)) && (i == 0 || !Character.isHighSurrogate(s.charAt(i - 1)))
                    || Character.isHighSurrogate(s.charAt(i)) && (i == s.length() -1 || !Character.isLowSurrogate(s.charAt(i + 1)))) {
                return false;
            }
        }
        return true;
    }
    

    here's a test

    public static void main(String args[]) {
        System.out.println(isValidUTF16("\uDC00\uDBFF"));
        System.out.println(isValidUTF16("\uDBFF\uDC00"));
    }