Search code examples
javaunicodeutf-8utf-16codepoint

How to sort in Unicode code point (UTF8 or UTF32) sorted order in java?


Java's String.compareTo uses UTF16 sorted order.

List<String> inputValues = Arrays.asList("𝐴","figure", "flagship", "zion");
Collections.sort(inputValues);

Above code results into sorted order [zion, 𝐴, figure, flagship] However, I want this sorted order to be [zion, figure, flagship, 𝐴] Note that some of the characters are ligatures.


Solution

  • Sorry, I am not looking for lexicographic sorting but simply sorting based on Unicode code point (UTF-8 or UTF-32).

    There is a comment in one of the libraries that I am trying to use:

    Input values (keys). These must be provided to Builder in Unicode code point (UTF8 or UTF32) sorted order. Note that sorting by Java's String.compareTo, which is UTF16 sorted order, is not correct and can lead to exceptions while building the FST

    I was running into issues because I was using Collections.sort which is UTF-16 sorted order for Java. Finally I wrote my own compare function as below which resolves the issues I am facing. I am surprised that it is not available natively or with some other popular libraries.

    public static void sort(List<String> list) {
        Collections.sort(
                list,
                new Comparator<String>() {
                    @Override
                    public int compare(String s1, String s2) {
                        int n1 = s1.length();
                        int n2 = s2.length();
                        int min = Math.min(n1, n2);
                        for (int i = 0; i < min; i++) {
                            int c1 = s1.codePointAt(i);
                            int c2 = s2.codePointAt(i);
                            if (c1 != c2) {
                                return c1 - c2;
                            }
                        }
                        return n1 - n2;
                    }
                });
    }