Search code examples
hadoophdfshadoop2distcp

How can I execute hadoop distcp -f command properly?


I want to get backup, on my hadoop cluster, for some folders and files. I ran this command:

hadoop distcp -p -update -f hdfs://cluster1:8020/srclist hdfs://cluster2:8020/hdpBackup/

My srclist file :

hdfs://cluster1:8020/user/user1/folder1
hdfs://cluster1:8020/user/user1/folder2
hdfs://cluster1:8020/user/user1/file1

folder1 contains two files : part-00000 and part-00001

folder2 contains two files : file and file_old

That command works but explodes all folders contents.

Result :

--hdpBackup
  - part-00000
  - part-00001
  - file1
  - file
  - file_old

But I want to get result :

--hdpBackup
  - folder1
  - folder2
  - file1

I can not use hdfs://cluster1:8020/user/user1/* because user1 contains many folders and files.

How can I solve this problem ?


Solution

  • Use the script below, it is shell programming:

     #!/bin/sh
    
     for line in `awk '{print $1}' /home/Desktop/distcp/srclist`;
     do
     line1=$(echo $line | awk 'BEGIN{FS="/"}{print $NF}')
    
     echo "$line  $line1 file are source dest" 
    
     hadoop distcp  $line hdfs://10.20.53.157/user/root/backup1/$line1
    
     done
    

    srclist file needs to be in the local file system contails paths like:

       hdfs://10.20.53.157/user/root/Wholefileexaple_1
       hdfs://10.20.53.157/user/root/Wholefileexaple_2