Search code examples
pythonhadoopamazon-web-servicesemr

import custom function in MapReduce code on AWS EMR


I have been struggling with this for 2 hours now!

I created a mapper script in python which is importing one of my custom functions in other python script.

    #!/usr/bin/env python
    import sys

    import testImport

    for line in sys.stdin:
        if line and line!='':
            words = line.strip().lower().split('\t')
            print '%s\t%s' % (words[0].strip(),testImport.age_classify(int(words[1])))

This code works well on my terminal....the problem is when i upload this mapper function to AWS Elastic MapReduce. My job fails with error saying "Failed to import module testImport".

testImport is a file 'testImport.py' which contains some of my helper functions (like the age_classify function), which i need to operate on each line of standard input.

I uploaded the script in the same bucket as my mapper script(the given script).

I tried to pass it in the arguments section when i add 'Streaming program' step. I have no clue what to do even after seeing all the related questions.

How can i get this done???

Any help would be really great!

Thank you!


Solution

  • AS you have said i have uploaded testImport.py in same bucket as that of map/reduce script. EMR can not read from that bucket unless you specify.

    For java , we created on fatjar for all related classes and create single jar file and execute it. for your python script , try to create single map script and reducer script and run it.