Search code examples
hadoopamazon-s3apache-pighdfsemr

Remove directory level when transferring from HDFS to S3 using S3DistCp


I have a Pig script (using a slightly modified MultiStorage) that transforms some data. Once the script runs, I have data in the following format on HDFS:

/tmp/data/identifier1/indentifier1-0,0001  
/tmp/data/identifier1/indentifier1-0,0002  
/tmp/data/identifier2/indentifier2-0,0001  
/tmp/data/identifier3/indentifier3-0,0001

I'm attempting to use S3DistCp to copy these files to S3. I am using the --groupBy .*(identifier[0-9]).* option to combine files based on the identifier. The combination works, but when copying to S3, the folders are also copied. The end output is:

/s3bucket/identifier1/identifier1
/s3bucket/identifier2/identifier2
/s3bucket/identifier3/identifier3

Is there a way to copy these files without that first folder? Ideally, my output in S3 would look like:

/s3bucket/identifier1
/s3bucket/identifier2
/s3bucket/identifier3

Another solution I've considered is to use HDFS commands to pull those files out of their directories before copying to S3. Is that a reasonable solution?

Thanks!


Solution

  • The solution I've arrived upon is to use distcp to bring these files out of the directories before using s3distcp:

    hadoop distcp -update /tmp/data/** /tmp/grouped
    

    Then, I changed the s3distcp script to move data from /tmp/grouped into my S3 bucket.