I have two word documents which i am trying to compare in java . I tried using
md5 hashcode
HashCode newFile = Files.asByteSource(newFileInput).hash(Hashing.md5());
HashCode oldFile = Files.asByteSource(oldFileInput).hash(Hashing.md5());
and also using,
boolean isEqual = FileUtils.contentEquals(oldFile , newFile);
Even though the contents are same ,compared the content using online tools and beyond compare, still the hashcode in both above method comes as MISMATCH.
any solutions? or way to compare any file type using any API in Java. i need to do deep compare between two word files as in for spaces,fonts , content. etc..
Expected Result : Both file should match
There were multiple suggestions and answers to my question, thanks for that. The reasons for mismatch in docx file is there in the metadata info, everytime we create a doc/docx file, the timestamp changes. Though i tried to change the timestamp(accessed,modified and created) of both the files to make it same and compare, which didn't work out. The reason is apart from these time stamps, there is a meta info called Zip Modify Date, which isn't visible when we see the file properties. this timestamp i found as one of the reason there was mismatch in hashcode. Also, the base64 encoded strings was different because of the zip timestamp.
So, the options i had to do the comparison were :
1. converting the docx file to xml file
2. Zip the docx file, unzip it and iterate though all the xml files to find the hashcode and compare the hascodes.(suggeted as of the answers)
"2" was good but it required lot of iterations and unzipping would create lot many folders.
"1" , was straight fwd, as i tried it using external lib -> docx4j , which converted the docx to xml and then i could match the hashcode , it worked.
Convert DOCX to XML file
I had to try different options since i was looking for simplest and not so complex way to compare content and styles of the word document.