Search code examples
hadoopamazon-web-servicesamazon-s3hdfsdistcp

Multiple source files for s3distcp


Is there a way to copy a list of files from S3 to hdfs instead of complete folder using s3distcp? this is when srcPattern can not work.

I have multiple files on a s3 folder all having different names. I want to copy only specific files to a hdfs directory. I did not find any way to specify multiple source files path to s3distcp.

Workaround that I am currently using is to tell all the file names in srcPattern

hadoop jar s3distcp.jar
    --src s3n://bucket/src_folder/
    --dest hdfs:///test/output/
    --srcPattern '.*somefile.*|.*anotherone.*'

Can this thing work when the number of files is too many? like around 10 000?


Solution

  • Yes you can. create a manifest file with all the files you need and use --copyFromManifest option as mentioned here