Search code examples
javavisual-studio-codeunicodeencoding

Katakana character in Java displaying as '?' in Visual Studio Code


A question mark gets displayed on the terminal in VS code when I use a value of a Japanese katakana character in hexadecimal. I want the character to be displayed.

char c='\ua432';

When I display var c I get the output as:

?

I am a newbie to java,I use VS code and I follow the 'complete reference of java' by Herbert Schildt

The book says:

For hexadecimal, you enter a backslash-u ( \u), then exactly four hexadecimal digits. For example, ' \u0061' is the ISO-Latin-1 'a' because the top byte is zero. \ua432 ' is a Japanese Katakana character.

I surfed SO however I got my mind clogged with haze. I couldn't get a direct workaround for the problem, I couldn't understand those solutions. I found that the main reason maybe the console I am printing it to doesn't understand the character encoding I am writing out, or else the console font doesn't have a glyph for the character. I have no idea on how to solve it

All I am asking for quick and easy solution !

I hope this question is not a dupe because if a solution is out there in SO I just want that to be an implementable solution.


Solution

  • You need some context to understand the answer.

    All I am asking for quick and easy solution !

    And I want world peace. We both asked for the impossible. There is a quick and easy solution, but from what you described it is not possible to know what that is, and it is not possible to explain to you what you need to tell us / what you need to do to find that quick and easy solution, without giving you a ton of context.

    The answer is almost certainly 'go into your VSCode preferences/settings and change either an encoding setting or a font setting or both'.

    Fundamentals

    A lot of things in the computer world are based on bytes. For example, when you make connections over the internet (say, you type www.stackoverflow.com in your browser's toolbar and hit enter - your browser application tells the OS to tell the network system to connect to the internet and open a pipe. Kinda like picking up the telephone. Instead of voice audio being carried over the 'pipe', it's bytes). Files on disk? You read and write bytes to these.

    Crucially, also, applications are considered to have something called 'standard in' and 'standard out' - these are simply the inputs and outputs of the program. By default if you start applications on the command line, the keyboard is hooked up to standard in, and the terminal (the big black box thing you're typing commands into) is hooked up as standard out. But you can change it, for example, in unixy systems you can write:

    cat foo.txt >/dev/printer
    

    Which will actually print out foo.txt. That's because the cat command means: "Open the stated file and spew the entire contents out to the standard output" and >/dev/printer says: Instead of the default 'the black box thing' as standard output, I want it to be sent to the file '/dev/printer' (which is a device, but in unix systems, they live on the file system too, so they have a path). Assuming things are hooked up right, the printer will spit out a page if you do this.

    The crucial thing to understand is, fundamentally, standard out and standard in are bytes, too!

    However, text and bytes aren't compatible. A byte is a value between 0 and 255, i.e. it can represent at most 256 different unique things. However, there are far more than 256 unique characters. Most of the computing world solves that problem by using unicode, but unicode about a million symbols (or at least, has room for that many), clearly, we can't use a single byte to represent a unicode character.

    When you write cat foo.txt you're asking cat to send the contents of that text file to the standard output (your screen, by default), but there's a problem here: standard output is a 'stream of bytes', and you don't want to see the bytes, you want to see characters. Thus, the code that runs that black box thing (cmd.exe on windows, your terminal and your shell on unixy systems) is being tasked to interpret the bytes and guess what symbols those might represent.

    In order to do that, there is a charset encoding involved.

    The key takeaway: Anytime bytes turn into characters or vice versa, a charset encoding is applied. To make things 'easy', loads of systems will just take a wild stab in the dark and apply one whenever such conversions occur. cat foo.txt itself would do that, after all: you're not specifying a charset encoding. In fact, cat has no cat --charset-encoding=UTF8 style option, because cat itself can only emit bytes, it's the terminal/cmd.exe/shell that has to do it.

    The same rule applies everywhere. If the server that runs stackoverflow.com answers, it has to send bytes, your browser reads these but needs to show text, so it... applies a charset encoding to convert bytes to text. Similarly, when you submitted the form to submit your question, your browser had a bunch of characters, and had to apply a charset encoding to turn these into bytes in order to send it to the stackoverflow.com server.

    Let's make that practical!

    Let's say I open an editor and write, literally:

    class Main { public static void main(String[] args) {
      System.out.println("☃");
    }}
    

    I then hit 'save'.

    This immediately causes some charset conversion to be applied. After all, the editor is a text-based notion, and when I save, that's: Write the contents of what I am writing to a file. Files are byte-based notions, thus, text-to-byte conversion occurs.

    You'll find that every non-idiotic editor will indicate, or otherwise have some option, to tell it what encoding to apply when saving your file.

    I then switch to the command line and type:

    javac Main.java
    

    javac opens the file and parses my source code. Source code is a text based concept, files are a byte based concept, so javac applies a charset encoding to turn the file contents into characters so it can parse them. And, indeed, javac has a --charset option, as it should.

    Fortunately, java has a built in system to store characters in a known manner in class files; no charset encoding is applied to do it, so at least here there is no char-to-byte conversion that we need to worry about.

    We then type:

    java Main
    

    And the java executable dutifully powers up and executes the code in the main method. Which is to print that string (we established java class files have sorted that out for us; not all programming systems do, but java does, fortunately!) - it does that, but System.out is a byte based concept, so to do it, java.exe has to apply char-to-byte conversion, and thus, applies a charset encoding!

    The bytes then arrive at /bin/bash which needs to turn that back into characters, and.. applies a charset encoding.

    So, in this ridiculously simple act of writing a 'hello world' java app and running it:

    • Our editor encoded text (our source code)
    • Our javac decoded a file (the file with our source code).
    • Our java executable encoded text (the snowman string)
    • Our terminal setup decoded bytes from the java.exe process's standard out (the snowman string)

    Those last 2 steps seem particularly stupid (we have text, why are we encoding it to bytes just so that the font rendering of the terminal can immediately decode those bytes right back to characters again??) - but, remember, the setup is that standard-in and standard-out can be redirected to printers, files, FTP servers, other applications, you name it.

    Unless each of those 4 steps picked the same encoding, you get mojibake.

    Hopefully the entire chain all picks UTF-8, which is an encoding that can represent every single one of those ~million unicode symbols that exist. Not all charset encodings can do that. However, in practice, lots of systems (particularly windows) do not have UTF_8 and mojibake is likely to occur.

    So what explains that ??

    There are 2 different explanations. From your question it simply isn't clear which of the two explanations is correct.

    Explanation 1: Mojibake

    You used \ua432 method in your java source file. You could have just as easily written:

    char c = 'ꐲ';
    

    and if your editor and javac's --charset option are in agreement this would have resulted in identical (literally, completely identical class files) results. That \u thing is just a way to write such characters if you're not sure your editor's and javac's charsets line up, or if the charset you are using is incapable of storing this character. (Lots of different encodings treat ASCII symbols, which includes your basic space, a through `, 0 through 9, :, dollar, underscore, dash, etcetera - the same way. So if you take some text that consists solely of ASCII characters, encode it using one encoding, then decode it using a completely different encoding, often it works fine. That still means your process is broken, but you'll want realize this unless you render non-ASCII stuff. Such as ꐲ, but even ë and € aren't ASCII!.

    Thus, you've eliminated any opportunity for a mismatch in the first 2 steps (writing your source file, and having javac compile the source file). That leaves steps 3 and 4.

    The little terminal-esque thingies in code editors are notorious for being crappy at encodings. It's definitely possible that java.exe assumes the encoding the system uses is UTF_8, therefore converts your text via UTF_8 to bytes to send to its standard out, but that VSCode or where-ever you are running this actually applies, say, Cp1252 encoding or somesuch. If that's the case there's nothing you can do: Cp1252 can't represent ꐲ at all, so it's not possible to print it. Perhaps you can find a setting in VSCode to change the charset it uses to decode the bytes to characters (which it has to do to show the characters on your screen in the little terminal box thing in your editor).

    ANSWER 1: Check your editor's setting, ensure that whatever your system's default charset encoding is, lines up. Most likely, just set everything to UTF-8.

    Explanation 2: Font limits

    As I mentioned, unicode has room for a million symbols. It doesn't quite have that many, but there are a lot of unicode symbols.

    Unicode just explains which number maps to which symbol. It doesn't actually dictate what things look like. A font does that: A font is a table that maps characters onto an image or vector instructions to actually draw these things on your display.

    Fonts don't necessarily have an image for every character.

    For example:

    This sentence

    This sentence

    looks completely different. Check the how in the second one, the 'i' has serifs (little horizontal bars) and in the first one, it doesn't. That's because the actual characters are identical, but, a different font is used. One font maps the 'i' character to the image 'i', the other to the image 'i'.

    ANSWER #2: The font used by VSCode in the 'terminal' box simply doesn't have a mapping from A432 to any image at all. Which means the font rendering engine shows 'the placeholder "I have no image for this character" symbol instead, which could be ?'. To fix this, go into your VSCode settings and change which font is used in the console view.

    Those 'images' are generally called 'glyphs', so this condition is called a 'missing glyph'.

    Usually, and the unicode spec strongly suggests this, font rendering engines should use the □ (white square, 25A1), or possibly � (replacement character) to do this. It would use ? only if the font also has no glyph for either of those. Which is certainly possible - various programmer fonts don't have any glyphs for anything outside of the ASCII range. If that is the case, ? will be used instead, which is what you're seeing!


    [1] Wellll, there's the whole topic of surrogate pairs and how one can colour an argument to say it really isn't true unicode, but OP specifically is complaining about complexity, and it doesn't apply to the exact symbol OP wants to display, so, let's not get into that.