Search code examples
amazon-web-servicesamazon-s3emramazon-emr

Use of S3DistCp groupBy clause


I have to copy files from one s3 bucket to another. There are many folders within source bucket and we have to pick only one file from each folder. E.g., below is the sample structure-

s3://mysrcbucket/CustomerID1/File1
s3://mysrcbucket/CustomerID1/File2
s3://mysrcbucket/CustomerID2/File1
s3://mysrcbucket/CustomerID2/File2
s3://mysrcbucket/CustomerID2/File3

I have prepared a manifest list (to be used in s3distcp) which holds the name of files which I need to copy for each customer, like -

s3://mysrcbucket/CustomerID1/File2
s3://mysrcbucket/CustomerID2/File3

Since there is only one file per customer that needs to be copied, at target the file name should be converted into respective customerID. Something like-

Expected Result
s3://mytrgtbucket/CustomerID1  (this will hold the content of file-CustomerID1/File2)
s3://mytrgtbucket/CustomerID2  (this will hold the content of file-CustomerID2/File3)

I am using groupby clause here, and I am able to create file with customer ID, but it creates another folder with CustomerID, e.g.,-

Current Result
s3://mytrgtbucket/CustomerID1/CustomerID1
s3://mytrgtbucket/CustomerID2/CustomerID2.

The command that I used is-

s3-dist-cp --src=s3://mysrcbucket/ --dest=s3://mytrgtbucket/ --copyFromManifest --previousManifest=s3://mysrcbucket/manifest.gz --groupBy='.*(CustomerID\d)/.*'

Is there something that can be done to achieve Expected Result, instead of Current Result.


Solution

  • I made it work by modifying manifest file.

    Earlier version-

    {"path":"s3://mytrgtbucket/CustomerID1/File2.txt","srcDir":"s3://mytrgtbucket/"}
    {"path":"s3://mytrgtbucket/CustomerID2/File3.txt","srcDir":"s3://mytrgtbucket/"}
    

    Working version-

    {"path":"s3://mytrgtbucket/CustomerID1/File2.txt","srcDir":"s3://mytrgtbucket/CustomerID1/"}
    {"path":"s3://mytrgtbucket/CustomerID2/File3.txt","srcDir":"s3://mytrgtbucket/CustomerID2/"}