Search code examples
amazon-s3mapreducehdfsbigdataemr

How map-reduce works on HDFS vs S3?


I have been trying to understand how different a map-reduce job is executed on HDFS vs S3. Can someone please address my questions:

Typically HDFS clusters are not only storage oriented, but also contain horsepower to execute MR jobs; and that is why the jobs are mapped on several data nodes and reduced on few. To be exact, the mapping (filter etc) is done on data locally, whereas the reducing (aggregation) is done on common node.

Does this approach work as it is on S3? As far as I understand, S3 is just a data store. Does hadoop has to COPY WHOLE data from S3 and then run Map (filter) and reduce (aggregation) locally? or it follows exactly same approach as HDFS. If the former case is true, running jobs on S3 could be slower than running jobs on HDFS (due to copying overhead).

Please share your thoughts.


Solution

  • Performance of S3 is slower than HDFS, but it provides other features like bucket versioning and elasticity and other data recovery schemes(Netflix uses a Hadoop cluster using S3).

    Theoretically, before the split computation, the sizes of input files need to be determined, so hadoop itself has an filesystem implementation on top of S3 which allows higher layers to be agnostic of the source of the data. Map-Reduce calls the generic file listing API against each input directory to get the size of all files in the directory.

    Amazons EMR have a special version of the S3 File System that can stream data directly to S3 instead of buffering to intermediate local files this can make it faster on EMR.