Search code examples
androidkotlinutf-8

kotlin android bad conversion of UTF-8 characters


I am encountering a basic issue with kotlin android (API 29)

import kotlin.text.Charsets.UTF_8

var buf = byteArrayOf(0xF0.toByte(), 0xa9.toByte(), 0xbd.toByte(), 0xbe.toByte())
var s = String( buf, UTF_8)
Log.e(TAG, "buf len ${buf.size} as UTF-8 : <$s8> len ${s8.length}")

After converting the string from byte buffer, I get string length = 2 (as if kotlin thinks the string is UTF-16), but of course the string length should be 1. No problem with Python!!

Output:

 buf len 4 as UTF-8 : <𩽾> len 2

I am a. it puzzled: How is that possible? How could kotlin string conversion be wrong?

Note that the UTF-8 F0A9BDBE is correct for the character 𩽾


Solution

  • This isn't strictly an Android or Kotlin phenomena, but the underlying Java String implementation. You'd get the same results in an equivalent Java implementation running on a desktop JVM, for example.

    length() for Java Strings is defined as "the number of Unicode code units in the string." (You alluded to Python in your question: Java and Python simply differ in this respect.)

    "Unicode code unit" is, in turn, defined as "16-bit char values that are code units of the UTF-16 encoding".

    Your character, 𩽾, is represented in UTF-16 with two surrogates (D867 + DF7E), so length() reports 2.

    What you might be ultimately interested in the number of code points, which more closely correlates to the number of characters that are rendered. String has methods like codePoints() and codePointCount() that could help here. (I'm not sure off the top of my head if you could use the streaming API that codePoints() has in your Android project.)

    Here's an example where I made an extension on String to avoid having to deal with indices every time:

    val buf = byteArrayOf(0xF0.toByte(), 0xa9.toByte(), 0xbd.toByte(), 0xbe.toByte())
    val s = String(buf, UTF_8)
    
    fun String.totalCodePointCount() = codePointCount(0, length)
    
    s.totalCodePointCount() // returns 1
    

    This blog post is a great overview that covers other topics like emoji, which my code point counting suggestion doesn't cover.