Search code examples
amazon-web-servicesamazon-emr

How to filter S3 files as input for Amazon EMR?


I'm trying to run Amazon EMR Hadoop process that will process CloudFront logs in S3 bucket. Since CloudFront generates a lot of logs in the same bucket, how do I filter the log files without generating extra bandwidth for S3 access?


Solution

  • I found that I can use FileSystem.globStatus() to quickly filter files from CloudFront logs bucket:

    FileSystem fs = new Path("s3://logs").getFileSystem(conf);
    for (FileStatus fileStatus: fs.globStatus("s3://logs/prefix-2015-11-01*")) {
       if (fileStatus.isFile()) {
          FileInputFormat.addInputPath(myJob, fileStatus.getPath());
       }
    }