I am trying to read a UTF-8 file from a zipFile and its turning out to be a major challenge.
Here I zip the String to a bytes array to persist to my db.
ByteArrayOutputStream bos = new ByteArrayOutputStream();
ZipOutputStream zo = new ZipOutputStream( bos );
zo.setLevel(9);
BufferedWriter writer = new BufferedWriter(
new OutputStreamWriter(bos, Charset.forName("utf-8"))
);
ZipEntry ze = new ZipEntry("data");
zo.putNextEntry(ze);
zo.write( s.getBytes() );
zo.close();
writer.close();
return bos.toByteArray();
And this is how I read the String back:
ZipInputStream zis = new ZipInputStream( new ByteArrayInputStream(bytes) );
ZipEntry entry = zis.getNextEntry();
byte[] buffer = new byte[2048];
ByteArrayOutputStream bos = new ByteArrayOutputStream();
int size;
while ((size = zis.read(buffer, 0, buffer.length)) != -1) {
bos.write(buffer, 0, size);
}
BufferedReader r = new BufferedReader( new InputStreamReader( new ByteArrayInputStream( bos.toByteArray() ), Charset.forName("utf-8") ) );
StringBuilder b = new StringBuilder();
while (r.ready()) {
b.append( r.readLine() ).append(" ");
}
The String that I get back here has lost the UTF8 charecters!
UPDATE 1: I changed the code around so that I compared the byte array of the original String with the byte array I read back from the zipfile and they freaking match! So its probably how I'm building the string after i have the bytes.
Arrays.equals(converted, orgi)
Your problem is in the writing, presuming s
is a String
, you have:
zo.write( s.getBytes() );
But that will convert s
to bytes using whatever the default encoding is. You'll want to use UTF-8 for that conversion:
zo.write( s.getBytes("utf-8") );
Your observation that the original bytes are the same as the uncompressed bytes make sense because the original written data is the source of the problem.
Note that you have the writer
stream declared but you never actually use it for anything (nor should you, in this context, since writing to it will just write uncompressed string data to the same stream bos
that your ZipOutputStream
writes to). It looks like you may have confused yourself trying a few different things at once here, you should just get rid of writer
.