amazon-web-services amazon-s3 mapreduce emr amazon-emr

Using MapReduce to read the files within a directory

My S3 directory is

/sssssss/xxxxxx/rrrrrr/xx/file1
/sssssss/xxxxxx/rrrrrr/xx/file2
/sssssss/xxxxxx/rrrrrr/xx/file3
/sssssss/xxxxxx/rrrrrr/yy/file4
/sssssss/xxxxxx/rrrrrr/yy/file5
/sssssss/xxxxxx/rrrrrr/yy/file6

How my mapreduce program to read these files on S3?

Solution

For one input path you do the following:

FileInputFormat.addInputPath(job, new Path("/sssssss/xxxxxx/rrrrrr/xx/"));

For two input paths, you do the following:

FileInputFormat.addInputPath(job, new Path("/sssssss/xxxxxx/rrrrrr/xx/"));
FileInputFormat.addInputPath(job, new Path("/sssssss/xxxxxx/rrrrrr/yy/"));

or use addInputPaths(). See the documentation of FileInputPath (depending on your version of Hadoop) for more details.