We have the following Java method to compress files using GZIPOutputStream
private void archive(Path originalFile) {
Path tempFile = originalFile.resolveSibling(originalFile.toFile().getName() + TEMPORARY_FILE_EXTENSION);
Path gzippedFile = originalFile.resolveSibling(originalFile.toFile().getName() + ARCHIVED_FILE_EXTENSION);
try {
try (FileInputStream input = new FileInputStream(originalFile.toFile());
BufferedOutputStream output = new BufferedOutputStream(new GZIPOutputStream(new FileOutputStream(tempFile.toFile())))) {
IOUtils.copy(input,output);
output.flush();
}
Files.move(tempFile, gzippedFile, StandardCopyOption.REPLACE_EXISTING);
Files.delete(originalFile);
LOGGER.info("Archived file {} to {}", originalFile, gzippedFile);
} catch (IOException e) {
LOGGER.error("Could not archive file {}: " + e.getMessage(), originalFile, e);
}
try {
Files.deleteIfExists(tempFile);
} catch (IOException e) {
LOGGER.error("Could not delete temporary file {}: " + e.getMessage(), tempFile, e);
}
}
The problem is that if we manually decompress back the file:
gzip -d file_name
The resulting decompressed file does not match the original file. The file size and the total number of lines are decreased. For example from 33MB to 32MB with a loss of 800K lines.
Could the issue be related with the encoding (EBCDIC) of the files we are compressing? https://en.wikipedia.org/wiki/EBCDIC
After several Tests we have not been able to reproduce the issue, it must have been related with not having enough space on the volume during the compression. @SirFartALot thanks for pointing that out.