python hadoop amazon-web-services elastic-map-reduce

File not cacheing on AWS Elastic Map Reduce

I'm running the following MapReduce on AWS Elastic MapReduce:

./elastic-mapreduce --create --stream --name CLI_FLOW_LARGE --mapper s3://classify.mysite.com/mapper.py --reducer s3://classify.mysite.com/reducer.py --input s3n://classify.mysite.com/s3_list.txt --output s3://classify.mysite.com/dat_output4/ --cache s3n://classify.mysite.com/classifier.py#classifier.py --cache-archive s3n://classify.mysite.com/policies.tar.gz#policies --bootstrap-action s3://classify.mysite.com/bootstrap.sh --enable-debugging --master-instance-type m1.large --slave-instance-type m1.large --instance-type m1.large

For some reason the cacheFile classifier.py is not being cached, it would seem. I get this error when the reducer.py tries to import it:

  File "/mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201204290242_0001/attempt_201204290242_0001_r_000000_0/work/./reducer.py", line 12, in <module>
    from classifier import text_from_html, train_classifiers
ImportError: No module named classifier

classifier.py is most definitely present at s3n://classify.mysite.com/classifier.py. For what it's worth, the policies archive seems to load in just fine.

Solution

I don't know how to fix this problem in EC2, but I've seen it before with Python in traditional Hadoop deployments. Hopefully the lesson translates over.

What we need to do is add the directory reduce.py is in to the python path, because presumably classifier.py is in there too. For whatever reason, this place is not in the python path, so it is failing to find classifier.

import sys
import os.path

# add the directory where reducer.py is to the python path
sys.path.append(os.path.dirname(__file__))
# __file__ is the location of reduce.py, along with "reduce.py"
# dirname strips the file name and only gives the directory
# sys.path is the python path where it looks for modules

from classifier import text_from_html, train_classifiers

The reason why your code might work locally is because of the current working directory in which you are running it from. Hadoop might not be running it from the same place you are in terms of the current working directory.