Search code examples
jsoncharacter-encodinggsoncharset

Special characters appearing in java code


I am fetching a JSON string as a response and converting it to a JSON object.
enter image description here

In the image above the description String can be seen to have a weird ? character surrounded by a color. I checked in the debugger the issue is after converting a JSON string to a JsonObject. So there is a code (mm is the JSON string):

JsonObject con=getCon(mm) 
private JsonObject getCon(String mm) {
    var file=new String(mm.getBytes(),StandardCharsets.UTF_8);
    return new GsonBuilder().create().fromJson(file,JsonObject.class).getAsJsonObject("dict").getAsJsonObject("con"); 
}  

I converted the first line to var file=new String(mm.getBytes("UTF-8"),StandardCharsets.UTF_8);
After this, the description String becomes like the last line in the attached image. This is really confusing. Not sure what could be going wrong here. The actual String in JSON is like Post Approval - Completed, Post Approval - Pending There are a lot of description attributes in the JSON string and this is happening only for a few of them. How can I debug this further?


Solution

  • Gson works only based on chars, for example in the form of a String or from a Reader. So any encoding issues you encounter most likely happen before Gson is called.

    The reason why new String(mm.getBytes(),StandardCharsets.UTF_8); is causing encoding issues is that String.getBytes() uses the platform default charset of your OS, which most likely is not UTF-8, and might not even support all Unicode characters. So converting the bytes then again to UTF-8 will produce incorrect results. There is normally never a good reason to use String.getBytes() (without Charset parameter); code analysis tools also often flag this as warning. Maybe the Policeman's Forbidden API Checker could be useful for you, it detects usage of error-prone methods like this.

    Your adjusted code new String(mm.getBytes("UTF-8"),StandardCharsets.UTF_8) is effectively a no-op; you are first converting a String to byte[] using UTF-8 and then reverse this again. (The only effect this might have is that incomplete surrogate pairs are replaced.)

    To debug this further you would have to check where the value of mm is coming from and at which point (if any) it still has the correct value. If you are reading it from a file, make sure you specify the correct encoding. Possibly it is not using UTF-8; editors such as VS Code and Notepad++ can automatically detect the encoding and show it.
    If the value comes from an HTTP response, verify that you are respecting the charset specified by the server in the Content-Type header. While the latest JSON specification says UTF-8 must be used, maybe the server is specifying a different encoding for whatever reason.