Search code examples
backupchecksumcrc32integrity

SFV/CRC32 checksum good and fast enough to check for common backup files?


I have 3 terabytes, more than 300,000 reference files of all sizes (20, 30, 40, 200 megas each) and I usually back them up regularly (not zipped). A few months ago, I lost some files probably due to data degradation (as I did "backup" of damaged files without notice).

I do not care about security, so do not need MD5, SHA, etc. I just want to be assured that the files I'm copying are good (the same bits and bytes) and verify that backups are intact after a few months before making backups again.

Therefore, my needs are basic because the files are not very important and there is no need for security (no sensitive information). My doubt: the format/method "SFV CRC/32" is good and fast for my needs? There is something better and faster than that? I'm using the program ExactFile.

Are there any checksum faster than SFV/CRC32 but that is not flawed? I tried using the MD5 but it is slow and since I do not need data security, I preferred the SFV/CRC32. Still, it's painful, because there are more than 300,000 files and takes hours to make the checksum of all of them, even with CPU xeon 8 cores HT and fast HDD.

From the point of view of data integrity , there is some advantage in joining all the files in one .ZIP or .RAR instead of letting them " loose " in folders and files?

Some tips?

Thanks!


Solution

  • If you could quantify "few" and "some" in "A few months ago, I lost some files" (where "few" would be considered to be replaced with "every few" in order to get a rate), then you could calculate the probability of a false positive. However just from those words, I would say, yes, a 32-bit CRC should be fine for your application.

    As for speed, if you have a recent Intel processor, you likely have a CRC-32C instruction, which can make the calculation much faster, by about a factor of 15. (See this answer for some code.) That could be made faster still by running it over multiple cores. If done right, you should be limited by the I/O, not the calculation.

    There is no advantage in this case to bundling them in a zip or rar. In fact it may be worse, if a corruption of that one file causes you to lose everything.