Search code examples
javakotlinutf-16

Java/Kotlin How do I iterate through a String, so that Combining characters stay with their base characters?


I am attempting to iterate the following string:

mɔ̃tr

But no matter what I do, it always ends up getting processed as:

m ɔ ̃ t r

The tilde seems to detach from the reversed c.

One of my first attempts was to do the following:

"mɔ̃tr".map {
    print(it)
}

But the tilde would not stay with the reversed c.

I saw suggestions for the following iterator:

fun codePoints(string: String): Iterable<String> {
    return object : Iterable<String> {
        override fun iterator(): MutableIterator<String> {
            return object : MutableIterator<String> {
                var nextIndex = 0
                override fun hasNext(): Boolean {
                    return nextIndex < string.length
                }

                override fun next(): String {
                    val result = string.codePointAt(nextIndex)
                    nextIndex += Character.charCount(result)
                    return String(Character.toChars(result))
                }

                override fun remove() {
                    throw UnsupportedOperationException()
                }
            }
        }
    }
}

But this gave the same output as the previous example.

I have been stuck on this seemingly simple problem for a day now, I just want to process this string as though it had 4 characters, not 5.

Any tips?


Solution

  • "ɔ̃" consists of two Unicode code points. This is why the code point iterator you showed still treats ɔ̃ as separate.

    "ɔ̃" is a single grapheme cluster. To iterate over those, you need a java.text.BreakIterator. In the documentation, there is an example that shows you how.

    public static void printEachForward(BreakIterator boundary, String source) {
        int start = boundary.first();
        for (int end = boundary.next();
             end != BreakIterator.DONE;
             start = end, end = boundary.next()) {
             System.out.println(source.substring(start,end));
        }
    }
    

    In Kotlin, you can write an extension function on String that returns you a Sequence of the grapheme clusters.

    fun String.graphemeClusterSequence() = sequence {
        val iterator = BreakIterator.getCharacterInstance()
        iterator.setText(this@graphemeClusterSequence)
        var start = iterator.first()
        var end = iterator.next()
        while (end != BreakIterator.DONE) {
            yield([email protected](start, end))
            start = end
            end = iterator.next()
        }
    }
    

    Now "mɔ̃tr".graphemeClusterSequence().forEach { println(it) } prints:

    m
    ɔ̃
    t
    r