Search code examples
javaapache-poitestngsha256

POI XSSF / XLSX hashing indeterminism with MessageDigest SHA-256


There seems to be a problem with getting deterministic hash values for the POI XLSX format, with MessageDigest SHA-256 implementation, even for empty ByteArray streams. This happens randomly, after several hundreds or even only thousands of iterations.

The relevant code snippets used to reproduce the problem:

// TestNG FileTest:
@Test(enabled = true) // indeterminism at random iterations, such as 400 or 1290
public void emptyXLSXTest() throws IOException, NoSuchAlgorithmException {
    final Hasher hasher = new HasherImpl();
    boolean differentSHA256Hash = false;
    for (int i = 0; i < 10000; i++) {
        final ByteArrayOutputStream excelAdHoc1 = BusinessPlanInMemory.getEmptyExcel("xlsx");
        final ByteArrayOutputStream excelAdHoc2 = BusinessPlanInMemory.getEmptyExcel("xlsx");

        byte[] expectedByteArray = excelAdHoc1.toByteArray();
String expectedSha256 = hasher.sha256(expectedByteArray);
byte[] actualByteArray = excelAdHoc2.toByteArray();
String actualSha256 = hasher.sha256(actualByteArray);

if (!expectedSha256.equals(actualSha256)) {
            differentSHA256Hash = true;
            System.out.println("ITERATION: " + i);
            System.out.println("EXPECTED HASH: " + expectedSha256);
            System.out.println("ACTUAL HASH: " + actualSha256);
            break;
        }
    }
    Assert.assertTrue(differentSHA256Hash, "Indeterminism did not occur");
}

Referenced Hasher and POI code:

// HasherImpl class:
public String sha256(final InputStream stream) throws IOException, NoSuchAlgorithmException {
    final MessageDigest digest = MessageDigest.getInstance("SHA-256");
    final byte[] bytesBuffer = new byte[300000]; 
    int bytesRead = -1;
    while ((bytesRead = stream.read(bytesBuffer)) != -1) {
        digest.update(bytesBuffer, 0, bytesRead);
    }
    final byte[] hashedBytes = digest.digest();
    return bytesToHex(hashedBytes);
}

Tried to eliminate indeterminism due to meta data like creation time, to no avail:

// POI BusinessPlanInMemory helper class:
public static ByteArrayOutputStream getEmptyExcel(final String fileextension) throws IOException {
    Workbook wb;

    if (fileextension.equals("xls")) {
        wb = new HSSFWorkbook();
    }
    else {
        wb = new XSSFWorkbook();
        final POIXMLProperties props = ((XSSFWorkbook) wb).getProperties();
        final POIXMLProperties.CoreProperties coreProp = props.getCoreProperties();
        coreProp.setCreated("");
        coreProp.setIdentifier("1");
        coreProp.setModified("");
    }

    wb.createSheet();

    final ByteArrayOutputStream excelStream = new ByteArrayOutputStream();
    wb.write(excelStream);
    wb.close();
    return excelStream;
}

The HSSF / XLS format seems not to be affected by the problem described. Does anybody have a clue, what could be causing this, if not a bug in POI itself? Basically, the code above refers to https://poi.apache.org/spreadsheet/examples.htmlBusinessPlan example

Thanks for your input!


Solution

  • Ok, I've found a way to reset all XSLX file entry file time attributes, according to some example found here at SO. Unfortunately, only file entries seem to be accessible by methods like via ZipFile or OPCPackage. I could not find a solution to also access and reset the folders inside the archive, which also have differing time attributes.

    So far, I didn't succeed in eliminating the differing properties of the POI-generated XLSX-archives, to get same SHA256 hashes out of two otherwise identical files, for which the differing attributes seem to be the reason.

    private void resetOPCPTimeAttributes(File file)
            throws InvalidFormatException, IOException, OpenXML4JException, XmlException {
    
        OPCPackage opcp = ZipPackage.open(file);
        resetZipfileContentTimeAttributes(opcp.getParts());
    
        opcp.flush();
        opcp.close();
    }
    
    private void resetZipfileContentTimeAttributes(List<PackagePart> parts) throws InvalidFormatException {
    
        ArrayList<PackagePart> subParts = null;
        for (PackagePart part: parts) {
    
            PackageProperties props = part.getPackage().getPackageProperties();
            props.setLastModifiedByProperty("");
            props.setCreatedProperty("");
            props.setModifiedProperty("");
    
            subParts = part.getPackage().getParts();
    
            while (subParts != null) {
                resetZipfileContentTimeAttributes(subParts);
            }
        }
    }
    

    Edit:

    In the meantime (until I or someone else finds a solution for manipulating folder meta data inside the Zip archive), I've switched to the deep comparing solution here: Comparing XLSX files