hadoop hive hadoop2 hortonworks-data-platform google-cloud-dataproc

Sync files on hdfs having same size but varies in contents

i am trying to sync files from one hadoop clutster to another using distcp and airbnb reair utility, but both of them are not working as expected.

if file size is same on source and destination both of them fails to update it even if file content are been changed(checksum also varies) unless overwrite option is not used.

I need to keep sync data of around 30TB so every time loading complete dataset is not feasible.

Could anyone please suggest how can i bring two dataset in sync if file size is same(count in source is changed) and have varied checksum.

Solution

The way DistCp handles syncing between files that are the same size but having different contents is by comparing its so-called FileChecksum. The FileChecksum was first introduced in HADOOP-3981, mostly for the purpose of being used in DistCp. Unfortunately, this has the known shortcoming of being incompatible between different storage implementations, and even incompatible between HDFS instances that have different internal block/chunk settings. Specifically, that FileChecksum bakes in the structure of having, for example, 512-bytes-per-chunk and 128MB-per-block.

Since GCS doesn't have the same notions of "chunks" or "blocks", there's no way for it to have any similar definition of a FileChecksum. The same is also true of all other object stores commonly used with Hadoop; the DistCp documentation appendix discusses this fact under "DistCp and Object Stores".

That said, there's a neat trick that can be done to define a nice standardized representation of a composite CRC for HDFS files that is mostly in-place compatible with existing HDFS deployments; I've filed HDFS-13056 with a proof of concept to try to get this added upstream, after which it should be possible to make it work out-of-the-box against GCS, since GCS also supports file-level CRC32C.