Search code examples
kotlincharacter-encoding

Kotlin Read String length twice the original length except whitespace


So my kotlin program gets this string from a cpp file which I don't want to change because it is about a machine learning process that I didn't fully understand but the code itself comes from the maker. So when I want to use the result which contains a string, it gives me a normal string from the outside but the length is messed up. For example, a string "aBc" will count as 6 in length. But the length of the whitespace is still normal, for example, if the string is "a Bc" it will count as 7 instead of 8. It give me a problem to identified the length and also if I want to take n characters from the string using function String.take or String.takeLast in Kotlin because the take function will take only half the size I want. How do I fix this?

case example:

val result = Model.label() \\ "aBc"

print(result.length.toString()) \\6

print(result.takeLast(5)) \\Bc

val result2 = Model2.label() \\ "a Bc"

print(result.length.toString()) \\7

How do I get it back to normal without touching the cpp (only change the kotlin file)

//item is the string from the model cpp

print(item)
print(item.length.toString())
val receivedBytes: ByteArray = item.toByteArray(Charset.forName("UTF-8"))

val receivedString = String(receivedBytes, Charset.forName("UTF-16"))
print(receivedString )
print(receivedString.length.toString())

Result:

PEREMPUAN
18
倍䔍刍䔍䴍倍唍䄍不
9

I assumed it was something related to encoding for cpp and kotlin is different and that is what happen. For information the model was created by Chinese people so maybe it have a relation


Solution

  • The string you received has an extra U+000D (aka carriage return, \r) character after each letter. I'm not sure why that is coming from your C++ code.

    Assuming you don't have any "legitimate" carriage returns that you want to keep, you can remove all the carriage returns by doing:

    val result = item.replace("\r", "")
    

    Then the number of characters in result would be as expected.

    If your string includes the Windows CRLF line separator (\r\n), and you don't want to remove the \r in those, you can use a regex like this to only remove \rs that does not follow a \n:

    val result = item.replace("\r(?!\n)".toRegex(), "")
    

    If there are other legitimate \r that you want to keep, you would need to come up with a way to differentiate between those \r and the \r you don't want.