Search code examples
transactionsfilesystemsarchive

Transaction-based archive container


I'm creating a Windows service that runs when a specific USB key is plugged in. What it does is simple: contact an FTP server, download some files, and store them in an (encrypted) archive on the USB. The archive can be opened read-only with a tool provided to the client (but that's irrelevant to my problem).

The service is used to keep the USB in sync with the master server (pretty much like Dropbox, but only download and the synchronized folders are on the removable media). The archive can grow up to a few gigabytes. About 1GB of the files are updated every week on the keys of around 400 users.

Since the entire update process is transparent to the user, there is the non-negligible chance that they unplug the USB when data is being written to the archive (even if I put some kind of screaming, flashy warning: DO NOT UNPLUG). Corrupting the archive would require to download it again in its entirety, that means quite a lot of bandwidth wasted on the already loaded servers.

So basically I need the writes to the archive to be transacted. It's OK if they fail, as long as they do not put the container in an inconsistent state. Either the file is entirely written, either it is not. It's OK if the file is partially written if the container does not actually "see" it.

The question is here: How can I guarantee data consistency at all times? Specifically, how do you make IO operations to work as transactions? What would you suggest? Shall I implement something on my own? Or are there already containers that offer this functionnality?

This is what I've got so far:

  • Create a new archive, rename on commit: not possible, the archive is too large.
  • Zip / Tar / 7z: unsuitable, a failed write will corrupt the archive
  • Truecrypt: unsuitable, as it requires a file system driver (Administrator privileges that the users dot not have).
  • Anything that requires mapping a file system to a file: unsuitable, pretty sure you can't do that without being Administrator, but if it's possible, it'd be great.
  • Storing files in a SQLite DB: it's ACID so that could indeed be a solution. However, it would require splitting the files as SQLite has a limited BLOB capacity. Not very elegant, but I'm ready to go that way. Also SQLite's transaction journal can get pretty large when storing large blobs.
  • Implement that on my own: I'd rather avoid that as much as possible but I'm not afraid to do it. I just find the topic pretty complex.

If this question is too general please move it to SU or something.


Solution

  • You may want to try using something like svn or git to download encrypted differences; they typically can be used to reconstruct a file locally if it gets corrupted. Or just download diffs and use patch to generate the latest file version.

    You have other problems if the user unplugs a FLASH drive while it's in the process of writing data. Many are not reliable (at the flash block level, not the file system level) and can be corrupted to the point that a journaling file system like NTFS or EXT3 cannot recover. There's more detail here: https://superuser.com/questions/290060/can-flash-memory-be-physically-damaged-if-power-is-interrupted-while-writing