Search code examples
unicodekotlincodepoint

kotlin split utf string into single length sub strings using codepoint


I'm just starting kotlin so I'm sure there is an easy way to do this but I don't see it. I want to split a into single-length sub strings using codepoints. In Java 8, this works:

public class UtfSplit {
    static String [] utf8Split (String str) {
        int [] codepoints = str.codePoints().toArray();
        String [] rv = new String[codepoints.length];
        for (int i = 0; i < codepoints.length; i++)
            rv[i] = new String(codepoints, i, 1);
        return rv;
    }
    public static void main(String [] args) {
        String test = "こんにちは皆さん";
        System.out.println("Test string:" + test);
        StringBuilder sb = new StringBuilder("Result:");
        for(String s : utf8Split(test))
            sb.append(s).append(", ");
        System.out.println(sb.toString());
    }
}

Output is:

Test string:こんにちは皆さん
Result:こ, ん, に, ち, は, 皆, さ, ん, 

How would I do this in kotlin?? I can get to codepoints although it's clumsy and I'm sure I'm doing it wrong. But I can't get from the codepoints back to a strings. The whole string/character interface seems different to me and I'm just not getting it.

Thanks Steve S.


Solution

  • You are using the same runtime as Java so the code is basically doing the same thing. However, the Kotlin version is shorter, and also has no need for a class, although you could group utility methods into an object. Here is the version using top-level functions:

    fun splitByCodePoint(str: String): Array<String> {
        val codepoints = str.codePoints().toArray()
        return Array(codepoints.size) { index ->
            String(codepoints, index, 1)
        }
    }
    
    fun main(args: Array<String>) {
        val input = "こんにちは皆さん"
        val result = splitByCodePoint(input)
    
        println("Test string: ${input}")
        println("Result:      ${result.joinToString(", ")}")
    }
    

    Output:

    Test string: こんにちは皆さん

    Result: こ, ん, に, ち, は, 皆, さ, ん

    Note: I renamed the function because the encoding doesn't really matter since you are just splitting by Codepoints.

    Some might write this without the local variable:

    fun splitByCodePoint(str: String): Array<String> {
        return str.codePoints().toArray().let { codepoints ->
            Array(codepoints.size) { index -> String(codepoints, index, 1) }
        }
    }
    

    See also: