Search code examples

How to filter S3 files as input for Amazon EMR?

I'm trying to run Amazon EMR Hadoop process that will process CloudFront logs in S3 bucket. Since CloudFront generates a lot of logs in the same bucket, how do I filter the log files without generating extra bandwidth for S3 access?


  • I found that I can use FileSystem.globStatus() to quickly filter files from CloudFront logs bucket:

    FileSystem fs = new Path("s3://logs").getFileSystem(conf);
    for (FileStatus fileStatus: fs.globStatus("s3://logs/prefix-2015-11-01*")) {
       if (fileStatus.isFile()) {
          FileInputFormat.addInputPath(myJob, fileStatus.getPath());