Search code examples
javatarchecksumcrc32lz4

Why is my InputStream not reading all bytes of a specified file? (CRC32 archive validation)


I'm attempting to use a CheckedOutputStream/CheckedInputStream to calculate a CRC32 checksum for a tar archive to ensure data integrity in a volatile system with frequent total electronic power loss. But in some cases the method I'm using to validate the CRC after compression is complete returns an incorrect checksum and appears to read less bytes from the file than are available. This doesn't happen with every tar archive I create but only with some.

I'm using a multi-layered OutputStream stack to create a TAR archive using LZ4 compression in Java. The compression code is as follows:

/**
 * Create a lz4 tar archive of the specified files
 *
 * @param filesToCompress files to compress
 * @param zipFile         destination archive file
 * @return long array of {crc32value, totalBytesWritten}
 * @throws IOException if io error occurs
 */
public static long[] makeTarArchive(File[] filesToCompress, File zipFile) throws IOException
{
    //Create LZ4 archive
    final CRC32 cksum = new CRC32();
    CountingOutputStream countingOutputStreamRef = null;
    try (OutputStream out = Files.newOutputStream(zipFile.toPath(), StandardOpenOption.CREATE);
        // Used to calculate checksum as bytes are written to the root OutputStream
        CheckedOutputStream checkedOutputStream = new CheckedOutputStream(out, cksum);
        // Used to count all bytes written to the root OutputStream (for troubleshooting)
        CountingOutputStream countingOutputStream = new CountingOutputStream(checkedOutputStream);
        BufferedOutputStream bufferedOutputStream = new BufferedOutputStream(countingOutputStream);
        LZ4FrameOutputStream lz4FrameOutputStream = new LZ4FrameOutputStream(bufferedOutputStream);
        TarArchiveOutputStream zipOut = new TarArchiveOutputStream(lz4FrameOutputStream))
    {
        // Store reference for use after try-with-resources scope closes
        countingOutputStreamRef = countingOutputStream;

        for (File file : filesToCompress)
        {
            final TarArchiveEntry tarArchiveEntry = new TarArchiveEntry(file, file.getName());
            tarArchiveEntry.setSize(file.length()); //Specify the size of this file to be archived
            zipOut.putArchiveEntry(tarArchiveEntry); //Allocate a new archive entry

            //Write file bytes to allocated archive entry space
            try (InputStream in = Files.newInputStream(file.toPath()))
            {
                byte[] buf = new byte[4096];
                int n;
                while ((n = in.read(buf)) != -1)
                {
                    zipOut.write(buf, 0, n);
                }
            }

            zipOut.closeArchiveEntry(); //Close entry. This method MUST be called for all file entries that contain data.
        }
    }
    return new long[] {cksum.getValue(), countingOutputStreamRef.getByteCount()};
}

The crc checking validation method is as follows:

/**
 * Calculates the crc32 value of a compressed archive, and the total number of bytes read during crc calculation
 *
 * @param archive the archive to check
 * @return long array of {crc32value, totalBytesRead}
 */
@SuppressWarnings("java:S3626") // 'Continue' is present for code readability
private static long[] calculateArchiveCRC(final Path archive)
{
    final CRC32 crc32 = new CRC32();
    CountingInputStream countingInputStreamRef;
    try (InputStream fi = new FileInputStream(archive.toFile());
        CheckedInputStream checkedInputStream = new CheckedInputStream(fi, crc32);
        CountingInputStream countingInputStream = new CountingInputStream(checkedInputStream);
        BufferedInputStream bi = new BufferedInputStream(countingInputStream);
        LZ4FrameInputStream lz4i = new LZ4FrameInputStream(bi);
        TarArchiveInputStream ti = new TarArchiveInputStream(lz4i))
    {
        // Store reference for use after try-with-resources scope closes
        countingInputStreamRef = countingInputStream;
        while (ti.getNextTarEntry() != null)
        {
            continue; // Nothing to do - getNextTarEntry() reads all bytes in the current entry
        }
    }
    catch (IOException ioException)
    {
        LOG.error("Error checking CRC32 value for archive {}", archive, ioException);
        return new long[] {0L, 0L};
    }
    return new long[] {crc32.getValue(), countingInputStreamRef.getByteCount()};
}

I've tried also putting a byte-by-byte read loop inside the calculateArchiveCRC method in place of the 'continue' statement but the results didn't change.

I have an archive file that I'm able to reproduce the issue on repeatedly and when it was created, the returned value indicated that the stream stack had written 301613791 bytes to disk. Checking the size on disk shows the same exact file size. The archive appears to be well formed with no actual integrity issues. My assumption is that the crc value returned with that totalBytesWritten value is also correct since the size on disk matches the totalBytesWritten value.

When I run the calculateArchiveCrc method on the archive, the totalBytesRead value ends up being 301613787 after the method completes cycling through all the bytes in the file. This doesn't make sense to me and I guess this is the question I'm asking. Why would my InputStream stack only be reading 301613787 bytes instead of the full 301613791. Its skipping 4 bytes at the end (or somewhere else?). Because its skipping the last 4 bytes, the CRC value does not match and the software is reporting the archive as corrupt. As I said before I don't think it is actually corrupt. I'm able to repeatedly compress the same set of files and this same issue occurs with the archive containing that specific set of files. This code works as expected for most archives it just seems that with some of them, all the bytes are not read.

I also ran a test where I add one additional file to the beginning of the archive with the same subset of other files still in the archive and the archive CRC worked out fine. It appears to only occur when I compress certain combinations of files. The files in question being mostly SQlite database files.

Is this some kind of InputStream implementation specific caveat? I.e. Is the TarArchiveInputStream or LZ4FrameInputStream implementations responsible for skipping these last 4 bytes?

Is my stream stack in the wrong order for the calculatArhiveCrc method?

What could be going on here?


Solution

  • I don't know how tar files are written so, maybe, the implementation of TarArchiveOutputStream or LZ4FrameOutputStream can make use of public long skip(long n) to jump ahead, or stop reading the tar file before the end of it.

    All is very implementation specific.

    My suggestion would be to ignore the decompression step in your check and just read fully the BufferedInputStream bi.