I'm trying to run Amazon EMR Hadoop process that will process CloudFront logs in S3 bucket. Since CloudFront generates a lot of logs in the same bucket, how do I filter the log files without generating extra bandwidth for S3 access?
I found that I can use FileSystem.globStatus()
to quickly filter files from CloudFront logs bucket:
FileSystem fs = new Path("s3://logs").getFileSystem(conf);
for (FileStatus fileStatus: fs.globStatus("s3://logs/prefix-2015-11-01*")) {
if (fileStatus.isFile()) {
FileInputFormat.addInputPath(myJob, fileStatus.getPath());
}
}