Search code examples
hadoopamazon-s3amazon-emrdistcps3distcp

Using GroupBy while copying from HDFS to S3 to merge files within a folder


I have the following folders in HDFS :

hdfs://x.x.x.x:8020/Air/BOOK/AE/DOM/20171001/2017100101
hdfs://x.x.x.x:8020/Air/BOOK/AE/INT/20171001/2017100101
hdfs://x.x.x.x:8020/Air/BOOK/BH/INT/20171001/2017100101
hdfs://x.x.x.x:8020/Air/BOOK/IN/DOM/20171001/2017100101
hdfs://x.x.x.x:8020/Air/BOOK/IN/INT/20171001/2017100101
hdfs://x.x.x.x:8020/Air/BOOK/KW/DOM/20171001/2017100101
hdfs://x.x.x.x:8020/Air/BOOK/KW/INT/20171001/2017100101
hdfs://x.x.x.x:8020/Air/BOOK/ME/INT/20171001/2017100101
hdfs://x.x.x.x:8020/Air/BOOK/OM/INT/20171001/2017100101
hdfs://x.x.x.x:8020/Air/BOOK/Others/DOM/20171001/2017100101
hdfs://x.x.x.x:8020/Air/BOOK/QA/DOM/20171001/2017100101
hdfs://x.x.x.x:8020/Air/BOOK/QA/INT/20171001/2017100101
hdfs://x.x.x.x:8020/Air/BOOK/SA/DOM/20171001/2017100101
hdfs://x.x.x.x:8020/Air/BOOK/SA/INT/20171001/2017100101
hdfs://x.x.x.x:8020/Air/SEARCH/AE/DOM/20171001/2017100101
hdfs://x.x.x.x:8020/Air/SEARCH/AE/INT/20171001/2017100101
hdfs://x.x.x.x:8020/Air/SEARCH/BH/DOM/20171001/2017100101
hdfs://x.x.x.x:8020/Air/SEARCH/BH/INT/20171001/2017100101
hdfs://x.x.x.x:8020/Air/SEARCH/IN/DOM/20171001/2017100101
hdfs://x.x.x.x:8020/Air/SEARCH/IN/INT/20171001/2017100101

Each folder has close to 50 files in it.My intention is to merge all the files within a folder to get a single file while copying it on S3 from HDFS. The issue I am having is with the regex with the groupBy option.I tried this and this does not seem to work :

s3-dist-cp --src hdfs:///Air/ --dest s3a://HadoopSplit/Air-merged/  --groupBy '.*/(\w+)/(\w+)/(\w+)/.*' --outputCodec lzo

The command works per se but i don't get the files within each folder merged into a single file which makes me believe that the issue is with my regex.


Solution

  • i figured this out myself only..the correct regex is

    .*/Air/(\w+)/(\w+)/(\w+)/.*/.*/.*
    

    and the command to merge and copy is :

    s3-dist-cp --src hdfs:///Air/ --dest s3a://HadoopSplit/Air-merged/  --groupBy '.*/Air/(\w+)/(\w+)/(\w+)/.*/.*/.*' --outputCodec lzo