I have to archive a large amount of data off of CDs and DVDs, and I thought it was an interesting problem that people might have useful input on. Here's the setup:
Right now, I was planning on using shutil.copytree to dump the files into a directory, and compare file trees and sizes. Maybe throw in a quick hash, although that will probably slow things down too much.
So my specific questions are:
When you read file by file, you're seeking randomly around the disc, which is a lot slower than a bulk transfer of contiguous data. And, since the fastest CD drives are several dozen times slower than the slowest hard drives (and that's not even counting the speed hit for doing multiple reads on each bad sector for error correction), you want to get the data off the CD as soon as possible.
Also, of course, having an archive as a .iso file or similar means that, if you improve your software later, you can re-scan the filesystem without needing to dig out the CD again (which may have further degraded in storage).
Meanwhile, trying to recovering damaged CDs, and damaged filesystems, is a lot more complicated than you'd expect.
So, here's what I'd do:
Block-copy the discs directly to .iso files (whether in Python, or with dd
), and log all the ones that fail.
Hash the .iso files, not the filesystems. If you really need to hash the filesystems, keep in mind that the common optimization of compression the data before hashing (that is, tar czf - | shasum
instead of just tar cf - | shasum
) usually slows things down, even for easily-compressable data—but you might as well test it both ways on a couple discs. If you need your verification to be legally useful you may have to use a timestamped signature provided by an online service, instead, in which case compressing probably will be worthwhile.
For each successful .iso file, mount it and use basic file copy operations (whether in Python, or with standard Unix tools), and again log all the ones that fail.
Get a free or commercial CD recovery tool like IsoBuster (not an endorsement, just the first one that came up in a search, although I have used it successfully before) and use it to manually recover all of the damaged discs.
You can do a lot of this work in parallel—when each block copy finishes, kick off the filesystem dump in the background while you're block-copying the next drive.
Finally, if you've got 1500 discs to recover, you might want to invest in a DVD jukebox or auto-loader. I'm guessing new ones are still pretty expensive, but there must be people out there selling older ones for a lot cheaper. (From a quick search online, the first thing that came up was $2500 new and $240 used…)