Search code examples
alibaba-cloud

How do i incrementally migrate HDFS data using the DistCp tool in Alibaba


I am trying to migrate the HDFS data using the DistCp tool in Alibaba E-Mapreduce. I understand how to do full data migration.

Command:

hadoop distcp -pbugpcax -m 1000 -bandwidth 30 hdfs://clusterIP:8020 /user/hive/warehouse /user/hive/warehouse

What parameters do I need to add to achieve incremental synchronization in the above code?


Solution

  • In order to do incremental data synchronization you will have to add -update and -delete flags, that should take care of the sync.

    hadoop distcp -pbugpcax -m 1000 -bandwidth 30  -update –delete hdfs://oldclusterip:8020 /user/hive/warehouse /user/hive/warehouse
    

    A little more info on both the parameters:

    -update, verifies the checksum and file size of the source and target files. If the file sizes compared are different, the source file updates the target cluster data. If there are data writes during the synchronization of the old and new clusters, -update can be used for incremental data synchronization.

    -delete, if data in the old cluster no longer exists, the data in the new cluster will be deleted.

    I hope this helps!