Search code examples
amazon-web-servicesnumpypysparkemr

AWS Spark EMR Numpy Import Error


I'm trying to submit a Python script on AWS EMR that imports numpy but I get

ImportError: No module named numpy 

I tried using one of the answers here: No module named numpy when spark-submitting. I created a bootstrap_actions.sh script that includes

 sudo yum install python-numpy python-scipy -y

and I run the script when I create the cluster but still get the import error. Any solution on how can I get import numpy to work?


Solution

  • For Amazon EMR you need to use bootstrap actions. Installing from the console only changes the master node and not the task nodes.

    runners:
      emr:
        bootstrap:
        - sudo yum install -y python27-numpy
    

    I am assuming that you will be using Python 2.7. If you are using Python 3.x, the link below has examples installing with PIP in the bootstrap. I am also assuming that you are using a recent EMR AMI.

    EMR Bootstrapping Cookbook