Search code examples
javahadoophdfsdistcpapache-ranger

Using -update option in java distcp


My goal is to use the java distcp api in java.
With command line i am able to perform a distcp :

hadoop --config /path/to/cluster2/hadoop/conf distcp -skipcrccheck -update hdfs://clusterHA1/path/to/file hdfs://clusterHA2/path/to/target

In java i get some trouble using -skipcrccheck and -update option.

final DistCpOptions distcpOption = new DistCpOptions(sourceFile, destFile);
distcpOption.setSkipCRC(true);
distcpOption.setSyncFolder(true);
runExitCode = this.distCpRun(sourceFile, destFile, distcpOption);

i get this Exception :

java.lang.IllegalArgumentException: Skip CRC is valid only with update options

when you look a the code, the order is very important, so i switch both options :

final DistCpOptions distcpOption = new DistCpOptions(sourceFile, destFile);
distcpOption.setSyncFolder(true);
distcpOption.setSkipCRC(true);
runExitCode = this.distCpRun(sourceFile, destFile, distcpOption);

i get :

java.io.IOException: Check-sum mismatch between source and target

i am pretty sure that setSyncFolder set the update option, in the DistCpOption :

public enum DistCpOptionSwitch {
SYNC_FOLDERS("distcp.sync.folders", new Option("update", false, "Update target, copying only missingfiles or directories")),
}

I am using hadoop 2.6.4 I have mismatch between both cluster because each cluster have is own instance of rangerKMS. I send file from uncrypted zone to crypted zone, this work well in command line.


Solution

  • I finally solve this problem by passing argument to the main function instead of using distcpOption builder.

    distCp.run(new String[] {"-skipcrccheck", "-update",source, destination });