Search code examples
javastringbuilderzip

Some characters get lost while writing to StringBuilder


Im dealing with Java code and here it is:

public InputStream unzip(InputStream inputStream) throws IOException {
        ZipInputStream zipIn = new ZipInputStream(inputStream);
        zipIn.getNextEntry();
        Scanner sc = new Scanner(zipIn);
        StringBuilder sb = new StringBuilder();
        while (sc.hasNextLine()) {
            sb.append(sc.nextLine());
            sb.append("\n");
        }
        System.out.println(sb);
        zipIn.close();
        InputStream is = fromStringBuffer(sb);
        return (InputStream)is;
    }

    public static InputStream fromStringBuffer(StringBuilder sb) {
          return new ByteArrayInputStream(sb.toString().getBytes());
        }

While I am unzipping the file some Turkish characters get in a weird format (like Ü becomes Ãœ).

How can I have them to be written to StringBuilder correctly?


Solution

  • Streams (of the java.io variety, as opposed to java.util.stream) are for reading (or writing) bytes.

    Scanner deals with chars. If you pass an InputStream to a Scanner, you need to provide a charset; otherwise it uses the default charset.

    But: this assumes that the byte stream passed to the Scanner actually does represent a stream of chars, using some charset. A ZipInputStream does not, necessarily: it's whatever the contents of the zipped file are. If you say there are characters missing, I presume your zipped file is text; but, from the perspective of reading from the zip file, it's just a stream of bytes.

    If you want an InputStream from a ZipInputStream, simply return the ZipInputStream.

    If you want to interpret the returned stream as chars, of course you will still need to know the charset; but you just won't have introduced unnecessary round-tripping from bytes to chars to bytes here.

    If you want all of the charset encoding to be handled inside this method, return a Reader, the analogue of InputStream that represents a stream of chars.

    For example, you could return an InputStreamReader, e.g. new InputStreamReader(zipIn, charset). This doesn't absolve you of the issues of knowing the correct charset; but it insulates callers of the method from having to deal with it instead.