hadoop mapreduce hadoop-streaming elastic-map-reduce

Splitting responsibilities of mappers on Elastic MapReduce (MySQL + MongoDB input)

I want to make sure I understand EMR correctly. I'm wondering - does what I'm talking about make any sense with EMR / Hadoop?

I currently have a recommendation engine on my app that examines data stored in both MySQL and MongoDB (both on separate EC2 instances) and as a result can suggest content to users. This has worked fine, but now I we're at a point where it is now taking longer to execute the script than the intervals in which it should be running. This is obviously a problem.

I'm considering moving this script to EMR. I understand that I will be able to connect to MongoDB and MySQL from my mapping script (i.e. it doesn't need to be a file on S3). What I'm wondering is - if I start examining the data on MySQL / S3 - does Hadoop have some method of making sure that the script doesn't examine the same records on each instance? Do I understand the concept of Hadoop at all even? Sorry if this question is really noob.

Solution

Yes, hadoop does make sure that the input records from DB are split and then only passed to mappers, i.e. same records will not be read by different mappers (even though they are run on same instance).

Speaking generally, task of splitting data is upto the chosen InputFormat, to quote from here:

Another important job of the InputFormat is to divide the input data sources (e.g., input files) into fragments that make up the inputs to individual map tasks. These fragments are called "splits" and are encapsulated in instances of the InputSplit interface. Most files, for example, are split up on the boundaries of the underlying blocks in HDFS, and are represented by instances of the FileInputSplit class. Other files may be unsplittable, depending on application-specific data. Dividing up other data sources (e.g., tables from a database) into splits would be performed in a different, application-specific fashion. When dividing the data into input splits, it is important that this process be quick and cheap. The data itself should not need to be accessed to perform this process (as it is all done by a single machine at the start of the MapReduce job).

You might have already read this, but this is a nice initial resource on DBInputFormat for hadoop.