I am working on a document management system and in order to detect changes in files/duplicates of files I am using sha256 to get the digests for comparison. This is being done in python. The system can be configured to encrypt the files before storage.
The question is whether it is still safe to store the digest for the unencrypted file.
This digest is used as an identifier for the stored files and is also used to detect if the file being added to the system already exists. I am okay with the chance of collision of sha256 algorithm for this purpose. I have also read that the digest produced by sha256 cannot be used to recreate the original data.
Assuming the file cannot be reconstructed from the hash and the fact that the file is stored in encrypted form, it should be safe to keep the original hash for comparisons/searching right... or should I rethink my strategy? these comparisons are going to be internal to the application and will not be exposed to the user in anyway.
Preimage resistence of SHA-256 is 2^256, and collision resistance is 2^128 (brief summary). On the other hand, you can simply check the number of combinations needed to guess the key to decrypt the file. SHA-256 preimage attack complexity is comparable to cracking 256-bit key for symmetric encryption. So, in general, I'd say, this approach is secure enough, because it's easier to restore the original file by guessing the key rather than finding preimage from SHA-256.
Would be good to know which algorithm and parameters you're going to use for file encryption, maybe in your case the answer would be different.