I have a Pig script (using a slightly modified MultiStorage) that transforms some data. Once the script runs, I have data in the following format on HDFS:
/tmp/data/identifier1/indentifier1-0,0001
/tmp/data/identifier1/indentifier1-0,0002
/tmp/data/identifier2/indentifier2-0,0001
/tmp/data/identifier3/indentifier3-0,0001
I'm attempting to use S3DistCp to copy these files to S3. I am using the --groupBy .*(identifier[0-9]).*
option to combine files based on the identifier. The combination works, but when copying to S3, the folders are also copied. The end output is:
/s3bucket/identifier1/identifier1
/s3bucket/identifier2/identifier2
/s3bucket/identifier3/identifier3
Is there a way to copy these files without that first folder? Ideally, my output in S3 would look like:
/s3bucket/identifier1
/s3bucket/identifier2
/s3bucket/identifier3
Another solution I've considered is to use HDFS commands to pull those files out of their directories before copying to S3. Is that a reasonable solution?
Thanks!
The solution I've arrived upon is to use distcp
to bring these files out of the directories before using s3distcp
:
hadoop distcp -update /tmp/data/** /tmp/grouped
Then, I changed the s3distcp
script to move data from /tmp/grouped
into my S3 bucket.