java linux character-encoding apache-commons

Java byte to String encoding problem on Linux

I am implementing a piece of software that works like this:

I have a Linux server running a vt100 terminal application that outputs text. My program telnets the server and reads/parses bits of the text into relevant data. The relevant data is sent to a small client run by a webserver that outputs the data on a HTML page.

My problem is that certain special characters like "åäö" is outputted as questionmarks (classic).

Background:
My program reads a byte stream using Apache Commons TelnetClient. The byte stream is converted into a String, then the relevant bits is substring'ed and put back toghether with separator characters. After this the new string is converted back into a byte array and sent using a Socket to the client run by the webserver. This client creates a string from the received bytes and prints it out on standard output, which the webserver reads and outputs HTML from.

Step 1: byte[] --> String --> byte[] --> [send to client]

Step2: byte[] --> String --> [print output]

Problem:
When i run my Java program on Windows all characters, including "åäö", are outputted correctly on the resulting HTML page. However if i run the program on Linux all special characters are converted into "?" (questionmark).

The webserver and the client is currently running on Windows (step 2).

Code:
The program basically works like this:

My program:

byte[] data = telnetClient.readData() // Assume method works and returns a byte[] array of text.

// I have my reasons to append the characters one at a time using a StringBuffer.
StringBuffer buf = new StringBuffer();
for (byte b : data) {
    buf.append((char) (b & 0xFF));
}

String text = buf.toString();

// ...
// Relevant bits are substring'ed and put back into the String.
// ...

ServerSocket serverSocket = new ServerSocket(...);
Socket socket = serverSocket.accept();
serverSocket.close();

socket.getOutputStream.write(text.getBytes());
socket.getOutputStream.flush();

The client run by webserver:

Socket socket = new Socket(...);

byte[] data = readData(socket); // Assume this reads the bytes correctly.

String output = new String(data);

System.out.println(output);

Assume the synchronizing between the reads and writes works.

Thoughts:
I have tried with different ways of encoding and decoding the byte array with no results. I am a little new to charset encoding issues and would like to get some pointers. The default charset in Windows "WINDOWS 1252" seems to let the special characters through all the way server to webserver, but the when run on a Linux computer the default charset is different. I have tried to run a "Charset.defaultCharset().forName()" and it shows that my Linux computer is set to "US-ASCII". I thought that Linux defaulted to "UTF-8"?

How should I do to get my program to work on Linux?

Solution

It's generally a bad idea to rely on the platform default encoding, especially for a network communication protocol.

Both new String() and String.getBytes() are overloaded to allow you to specify the encoding. Since you control encoding as well as decoding, simply use UTF-8 (hardcoded).

Also check your code for uses of FileInputStream, FileOutputStream, InputStreamReader and OutputStreamWriter, all of which ptentially rely on the platform default encoding (the first two, exclusively, which makes them pretty useless).