Search code examples
linuxhadoophdfsrsync

copy data from one HDFS directory to another continuously


I have a directory in hdfs which gets files populated every 2 days. I want to copy all the files in this directory to another in such a way that if a new file comes in today, I want the file to be copied to the duplicate directory.

How can we do that in Hdfs.

I know we can do that in linux using rsync. Is there any method like this in Hdfs as well?


Solution

  • No, there are no file sync methods available with HDFS. You have to either do hdfs dfs -cp or hadoop distcp manually or through any scheduler (cron).

    If the number of files are more, distcp is preferred.

    hadoop distcp -update <src_dir> <dest_dir>
    

    The -update flag would overwrite if source and destination differ in size, blocksize, or checksum.