Search code examples
hadoopamazon-ec2mapreduceelastic-map-reduceamazon-emr

Calling a compiled binary on Amazon MapReduce


I'm trying to do some data analysis on Amazon Elastic MapReduce. The mapper step is a python script which includes a call to a compiled C++ binary called "./formatData". For example:

# myMapper.py
from subprocess import *
inputData = sys.stdin.readline()
# ...
p1 = Popen('./formatData', stdin=PIPE, stdout=PIPE)
p1Output = p1.communicate(input=inputData)
result = ... # manipulate the formatted data
print "%s\t%s" % (result,1)

Can I call a binary executable like this on Amazon EMR? If so, where would I store the binary (in S3?), for what platform should I compile it, and how I ensure my mapper script has access to it (ideally it would be in the current working directory).

Thanks!


Solution

  • You can call the binary that way, if you make sure the binary gets copied to the worker nodes correctly.

    See:

    https://forums.aws.amazon.com/thread.jspa?threadID=35158

    For an explanation on how to use the distributed cache to make the binary files accessible on the worker nodes.