Search code examples
amazon-web-servicesamazon-s3mapreduceemramazon-emr

Using MapReduce to read the files within a directory


My S3 directory is

/sssssss/xxxxxx/rrrrrr/xx/file1
/sssssss/xxxxxx/rrrrrr/xx/file2
/sssssss/xxxxxx/rrrrrr/xx/file3
/sssssss/xxxxxx/rrrrrr/yy/file4
/sssssss/xxxxxx/rrrrrr/yy/file5
/sssssss/xxxxxx/rrrrrr/yy/file6

How my mapreduce program to read these files on S3?


Solution

  • For one input path you do the following:

    FileInputFormat.addInputPath(job, new Path("/sssssss/xxxxxx/rrrrrr/xx/"));
    

    For two input paths, you do the following:

    FileInputFormat.addInputPath(job, new Path("/sssssss/xxxxxx/rrrrrr/xx/"));
    FileInputFormat.addInputPath(job, new Path("/sssssss/xxxxxx/rrrrrr/yy/"));
    

    or use addInputPaths(). See the documentation of FileInputPath (depending on your version of Hadoop) for more details.