After many hours checking questions and answers on the stackoverflow, I couldn't get this to work. Here's the problem, consider the following JSON object from Facebook's downloadable JSON data:
{
"sender_name": "megalo\u00e5\u00bd\u00a9",
"timestamp_ms": 1679173611981,
"content": "Reacted \u00f0\u009f\u00a4\u008d to your message "
}
The problem: In the example JSON above, the sender name contains Japanese characters, and the chat message content contains a white heart, represented by the UTF-8 unicode escape sequence
which is \u00f0\u009f\u00a4\u008d
. However, when displayed on Android's TextView or Jetpack Compose, it displays as this ð¤
which are clearly two separate characters. Android failed to interpret the whole 4-part sequence as one entire emoji.
What didn't work: Reading the actual JSON with UTF-8 did not do it. Android will fail to understand that there is literally one emoji and not two unicode letters. Here's the parsing logic, a JSON read directly from a json file.
val actualJson = String(jsonInputStream.readBytes(), Charsets.UTF_8)
Why is Android not reading the UTF-8 content correctly ?
The workaround to solving this was kind of hacky. To make sure Android encodes Latin-1 characters first then leave the UTF-8 for last, I had to convert the string to a bytearray while considering it a Latin-1 string not UTF-8, but then decoding it back to UTF-8. I am not exactly sure why this worked but it's the only thing that did and I am glad it did since I was about to drop the whole thing completely after wasting hours looking for answers.
val finalString = String(initialString.toByteArray(Charsets.ISO_8859_1), Charsets.UTF_8)
This actually did the trick. No other solution worked not even the commons text's StringEscapeUtils.escapeJava/unescapeJava
methods.