Keeping track of dirty blocks on a block device

I'm looking for a way to keep track of what blocks on a block device are modified after a point in time. How I eventually want to use this for is to keep two 2TB disks in sync, one which only comes online (connected through USB) once a month. Without knowing what blocks have been modified, I have to go through the whole 2TB every time.

I'm using a recent GNU/Linux OS and have C and Python experience. I'm hoping to avoid writing kernel level code as I don't have any experience in that area whatsoever. My current theory is that there should be some hooks somewhere where my code can get called when a disk flush is performed.

Any ideas?

Solution

It should be possible to use Linux MD for this, provided you're careful to avoid a bug in the block layer. Every month or so, you add the USB disk as a new member of a 2-disk RAID set where one is missing by default, and let it do the synchronization of changed blocks. A write-intent bitmap seems beneficial for that, so don't forget to have one around.

# Creation
mdadm -C /dev/md0 -l 1 -n 2 -e 1.0 -b internal  /dev/sda  missing

# Addition of slave disk
mdadm /dev/md0 -a /dev/thatusbthing

See also a longer description of this setup, with more discussion of options/potential pitfalls.

Addendum:

rsync was designed to transfer files over a (comparatively slow) network. That means both sides will scan their device locally, compute that rolling checksum, and then transfer the chunks that changed. The changelist is of course dependent on calculating the checksums. (Reading at 30+MB/s from a disk is faster than unconditionally pushing at, say, 10MB/s, over a 100mbit network.)

With MD write-intent bitmaps, the scan stage is not necessary, because it already knows, by means of this bitmap, which blocks have been changed since the disks were last synchronized.